I want to explore the possibility of further decoupling event format from from transport.
Graylog2 supports AMQP as an input mechanism. The problem is that the messages need to be GELF or syslog formatted. Currently our approach is that the plugin does any custom formatting for the output. In the case of gelf output, we format a gelf message. Well technically we let the library format a gelf message. end pendant.
So now we're faced with a need to allow output to graylog2 over amqp. So do we create a new plugin called gelf_amqp or do we bloat the config syntax for gelf to encompass amqp options? Neither sounds amenable to me.
Many other things that we support as outputs also have secondary transport for support. One example is Graphite. Who knows what else may come down the pipe? (no pun intended).
We have filters today that are essentially 'codecs' that act on specified fields in the event. JSON filter is an example. I also want a CSV filter. Both CSV and JSON are serialization formats.
If we properly implement codecs, we can have a 'codec' filter that lets you use any codec on any field in an event. This will let people log CSV over syslog, etc.
The idea is to add an optional parameter to outputs/base.rb that specifies the event format. This way, there's no need to define a custom plugin for es river, graphite river, or graylog river. You simply add an additional attribute to an existing output and the event is formatted differently. Longer term (or even as a big refactor) we might now have lib/logstash/formats so that community contribution is much easier.
Downside? ElasticSearch river currently also does the autoconfiguration stuff for you. How to address that?
Instead of doing format as an attribute, flip everything on its head and make transport the attribute. If the transport provided matches a plugin, that plugins attributes are pulled into the existing one. I don't like this approach in the least as I've described it but it's worth mentioning.
Got some more feedback from Jordan on IRC. So that we're all on the same page, we're going to think of these in terms of a new feature called "codecs". As with inputs and outputs, there are encoders and decoders. Conceptually, the work more like serializers and deserializers.
With that in mind, we now have to consider how to express this in Logstash config. Obviously in many cases, this is pretty simple:
Those are the ones logstash understands right now. The difference between json and json_event is that json_event is the internal logstash json format whereas json is logstash's way of saying "I know that this message is json and will try to parse it that way".
Specifically around this topic, we want to add GELF as a supported codec. GELF is pretty straightfoward. There's no real variant allowed.
Now let's take a look at some other possible codecs we might need. Some of these are already things that people are asking for:
Now some of those require an IDL or a schema file. How do we now express that in logstash config language? Something like Avro embeds the schema in the message where as protobufs requires you to compile the .proto file first. Obviously there are extra steps.
Personally, I don't think the schema can be a appropriately represented without overly complicating the config language.
new config stanzas that work much like filters "encoder" or "serializer" and "decoder" or "deserializer"
move the encoder/decoder into the actually input/output lines
Either way, for many of these, we'll need a way to express the location/shape of the schema. In the case of protobufs and ruby, we actually have to require in the generated rb file. For avro, we could require a path to a json file that we prepended to the message.
Obviously none of these are clean as they could be. We already allow external files for filters so the logic is there. Additionally, how do we start to handle event.sprintf in these cases (for substitution). This also has the downside of introducing additional latency in the pipeline.
I'm +1'ing the fuck out of this. The lumberjack project for a lightweight shipper in C is awesome, but having a non-C shipper would be boss - that is what I am building with my own beaver project - and using something faster than ujson to json encode data in python would be killer.
+1 for Protobuf! Would be super mega ultra killer!
Done; codecs are implemented and released.