The Big UTF-8 Ticket
Description
discovered while testing
testing discovered
Gliffy Diagrams
Activity

Ryan Bellows April 24, 2014 at 10:37 PM
this is in 1.4.0
Jordan Sissel April 24, 2014 at 10:32 PM
Most of the UTF-8 problems known were fixed in 1.4.0. I do recommend you filing a separate bug any time you find issues so we can more adequately assess and respond to them

Ryan Bellows April 24, 2014 at 8:35 PM
Any update on this issue? I have a simple file input that crashes until I wipe out the sincedb file.
{:timestamp=>"2014-04-24T20:33:52.729000+0000", :message=>"A plugin had an unrecoverable error. Will restart this plugin.\n Plugin: <LogStash::Inputs::File type=>\"sfo1-php-errors\", path=>[\"/mnt/codebase/log/php_errors.log\"], sincedb_path=>\"/var/tmp/sfo1-logstash-php-errors.sincedb\", start_position=>\"end\">\n Error: invalid byte sequence in UTF-8", :level=>:error}

Miral Popat February 10, 2014 at 10:33 PM
I get error utf-8 logstash invalid byte sequence in UTF-8

Miral Popat February 10, 2014 at 10:30 PM
When is this issue likely to fix, I am using 1.3.3?
Users continue to have UTF-8 problems.
Current solutions:
users can set the correct charset at input-time for their data
Current problems:
logstash crashes sometimes due to users submitting data that they promise (as configured through logstash) is UTF-8 but it is not.
Before discussing options, I want to make clear that we want the following two properties to be available:
we should avoid default configurations that permit data loss or data corruption
we should avoid solutions to this encoding problem that, by default, cause performance problems.
Possible solutions:
Validate all input as being valid UTF-8, and if not, try alternate charsets? This has both performance costs as well as causing corruption if we pick the wrong charset.
Log the invalid data, but otherwise drop the event. This is also bad because dropping data is bad.
Continue to crash. This is bad because crashing is bad.
Any other options? All 3 solutions I can think of all drop or corrupt data.