Decouple message format from transport

Description

I want to explore the possibility of further decoupling event format from from transport.

Original use case

Graylog2 supports AMQP as an input mechanism. The problem is that the messages need to be GELF or syslog formatted. Currently our approach is that the plugin does any custom formatting for the output. In the case of gelf output, we format a gelf message. Well technically we let the library format a gelf message. end pendant.

So now we're faced with a need to allow output to graylog2 over amqp. So do we create a new plugin called gelf_amqp or do we bloat the config syntax for gelf to encompass amqp options? Neither sounds amenable to me.

Additional use cases

Many other things that we support as outputs also have secondary transport for support. One example is Graphite. Who knows what else may come down the pipe? (no pun intended).

Use case with Filters

We have filters today that are essentially 'codecs' that act on specified fields in the event. JSON filter is an example. I also want a CSV filter. Both CSV and JSON are serialization formats.

If we properly implement codecs, we can have a 'codec' filter that lets you use any codec on any field in an event. This will let people log CSV over syslog, etc.

Implementation

The idea is to add an optional parameter to outputs/base.rb that specifies the event format. This way, there's no need to define a custom plugin for es river, graphite river, or graylog river. You simply add an additional attribute to an existing output and the event is formatted differently. Longer term (or even as a big refactor) we might now have lib/logstash/formats so that community contribution is much easier.

Downside? ElasticSearch river currently also does the autoconfiguration stuff for you. How to address that?

Alternate implementation

Instead of doing format as an attribute, flip everything on its head and make transport the attribute. If the transport provided matches a plugin, that plugins attributes are pulled into the existing one. I don't like this approach in the least as I've described it but it's worth mentioning.

Activity

Show:
John E. Vincent
February 20, 2012, 3:30 PM

Got some more feedback from Jordan on IRC. So that we're all on the same page, we're going to think of these in terms of a new feature called "codecs". As with inputs and outputs, there are encoders and decoders. Conceptually, the work more like serializers and deserializers.

With that in mind, we now have to consider how to express this in Logstash config. Obviously in many cases, this is pretty simple:

  • raw

  • json_event

  • json

Those are the ones logstash understands right now. The difference between json and json_event is that json_event is the internal logstash json format whereas json is logstash's way of saying "I know that this message is json and will try to parse it that way".

Specifically around this topic, we want to add GELF as a supported codec. GELF is pretty straightfoward. There's no real variant allowed.

Now let's take a look at some other possible codecs we might need. Some of these are already things that people are asking for:

  • thrift

  • protobufs

  • avro

  • msgpack

  • hessian

Now some of those require an IDL or a schema file. How do we now express that in logstash config language? Something like Avro embeds the schema in the message where as protobufs requires you to compile the .proto file first. Obviously there are extra steps.

Personally, I don't think the schema can be a appropriately represented without overly complicating the config language.

Some ideas:

  • new config stanzas that work much like filters "encoder" or "serializer" and "decoder" or "deserializer"

  • move the encoder/decoder into the actually input/output lines

Either way, for many of these, we'll need a way to express the location/shape of the schema. In the case of protobufs and ruby, we actually have to require in the generated rb file. For avro, we could require a path to a json file that we prepended to the message.

Obviously none of these are clean as they could be. We already allow external files for filters so the logic is there. Additionally, how do we start to handle event.sprintf in these cases (for substitution). This also has the downside of introducing additional latency in the pipeline.

SidneiS
March 15, 2012, 6:02 AM

BSON and Sentry (base64+json+gzip) for more formats that would be interesting. See: https://groups.google.com/d/topic/logstash-users/X9Q2vYOWjeE/discussion

Jose Diaz-Gonzalez
August 1, 2012, 6:47 PM

I'm +1'ing the fuck out of this. The lumberjack project for a lightweight shipper in C is awesome, but having a non-C shipper would be boss - that is what I am building with my own beaver project - and using something faster than ujson to json encode data in python would be killer.

Alexander Jäger
November 12, 2012, 8:05 PM

+1 for Protobuf! Would be super mega ultra killer!

Jordan Sissel
September 4, 2013, 6:01 AM

Done; codecs are implemented and released.

Assignee

Jordan Sissel

Reporter

John E. Vincent

Labels

None

Fix versions

Affects versions

Configure