UndefinedConversionError with UTF-8 encoding in xml plugin

Description

-----------
Synopsis:
There seems to be a bug in the xml filter plugin.
When feeding it text containing UTF-8 encoded characters, the xml filter plugin reports:

Exception in filterworker {"exception"=>#<Encoding::UndefinedConversionError: ""\xC3\xB1"" from UTF-8 to US-ASCII>, "backtrace"=>["org/jruby/ext/stringio/StringIO.java:1140:in `write'", "nokogiri/XmlNode.java:1149:in `native_write_to'", "/home/user/applications/logstash-1.4.1/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.1-java/lib/nokogiri/xml/node.rb:798:in `write_to'", "/home/user/applications/logstash-1.4.1/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.1-java/lib/nokogiri/xml/node.rb:730:in `serialize'", "/home/user/applications/logstash-1.4.1/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.1-java/lib/nokogiri/xml/node.rb:753:in `to_xml'", "/home/user/applications/logstash-1.4.1/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.1-java/lib/nokogiri/xml/node.rb:613:in `to_s'", "/home/user/applications/logstash-1.4.1/lib/logstash/filters/xml.rb:118:in `filter'", "/home/user/applications/logstash-1.4.1/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.1-java/lib/nokogiri/xml/node_set.rb:237:in `each'", "org/jruby/RubyInteger.java:133:in `upto'", "/home/user/applications/logstash-1.4.1/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.1-java/lib/nokogiri/xml/node_set.rb:236:in `each'", "/home/user/applications/logstash-1.4.1/lib/logstash/filters/xml.rb:109:in `filter'", "org/jruby/RubyHash.java:1339:in `each'", "/home/user/applications/logstash-1.4.1/lib/logstash/filters/xml.rb:102:in `filter'", "(eval):23:in `initialize'", "org/jruby/RubyProc.java:271:in `call'", "/home/user/applications/logstash-1.4.1/lib/logstash/pipeline.rb:262:in `filter'", "/home/user/applications/logstash-1.4.1/lib/logstash/pipeline.rb:203:in `filterworker'", "/home/user/applications/logstash-1.4.1/lib/logstash/pipeline.rb:143:in `start_filters'"], :level=>:error}

----------
Affected versions:
1.4.0, 1.4.1
Note: using the same config and input with logstash 1.3.3 does NOT raise the issue.

----------
Details:
Best check the attached config file I provided for detailed information on the setup.

Note the "c3 b1" and "c3 a7" sequences in the hexdump (hexdump -C test_utf8.log):
[...]
00000140 3c 6c 61 6e 67 20 74 65 78 74 3d 22 45 73 70 61 |<lang text="Espa|
00000150 c3 b1 6f 6c 22 3e 65 73 5f 45 53 3c 2f 6c 61 6e |..ol">es_ES</lan|
00000160 67 3e 3c 6c 61 6e 67 20 74 65 78 74 3d 22 46 72 |g><lang text="Fr|
00000170 61 6e c3 a7 61 69 73 22 3e 66 72 5f 46 52 3c 2f |an..ais">fr_FR</|
[...]

Also note that the "charset" option in the input codec does not really matter. Setting this to "ISO-8859-1" produces the same results.
this seems to me as an internal conversion happening somwhere inside the xml plugin.

----------
Reproduce:
please find attached the config file and input (log file), as well as a primitive script containing the invocation command.

Make sure to:
1) Adjust the path in the config file to your environment (test_utf8.conf, line 3).
2) Adjust the path to both the logstash executable and the config file in the invocation command (start.sh)

Important: in order to be able to reproduce the issue, make sure the log file does not get accidentally modified, as this can happen easily and unnoticed (transfer in text mode, etc.)
best run
grep -aP '\xC3' *.log
to check for the presence of UTF-8 encoded chars.

Environment

None

Status

Assignee

Philippe Weber

Reporter

TheOlf

Labels

Fix versions

Affects versions

Priority

Configure