The current logstash json schema has a few problems:
It uses two namespacing techniques when only one is needed ("@" prefixing, like "@source", and "@fields" object for another namespace)
@source_host and @source_path duplicate @source.
Not all events have all '@named' fields.
Each known '@named' field is not documented well
I always describe events as "timestamp plus data" so let's start there, and make it versioned just because that's smarter.
Most minimal schema will be two fields: timestamp and version. All other values are optional.
Here's my proposal of a minimal schema including only two required fields - version and timestamp.
Removes all other '@-named' fields: @source_host, @source, @source_path, @type, @tags, @message.
The previous '@fields' namespace is gone, all "event fields" are now top-level.
The previous schema logstash used shall be known as 'version 0'
'json_event' should accept both. 'version 0' events must be converted to 'version 1' events
kibana isn't polluted with "@" symbols everywhere
most relevant data is in 'event fields' which is now top-level, no longer "@fields.somefield"
fewer "required" event fields.
the 'json' input format can go away or generally mean the same thing as json_event
Transition and Backwards Compatibility notes:
For previous events with a '@fields.foo = bar', now it will be 'foo = bar'. Elasticsearch lets you search by leaf names, so "foo:bar" will find both events. (victory)
Since @message is gone, need to figure out what to do about it.
Write an elasticsearch input plugin to allow conversion of old indexes to new schema.
describe conversion of 'version 0' events to 'version 1' (@fields flattening, @source removal, etc)
Re: Kibana, yeah. Kibana will continue to be mostly agnostic about the form of data coming from elasticsearch. Since @timestamp stays the same, kibana doesn't need any changes (especially kibana3, which has even fewer assumptions about data)
First off, dropping @fields and moving fields one level up is awesome. Thanks for that.
However, I'm not looking forward to the overloading of the non-prefixed (@) fields with meta-data from logstash' internal processing. Specifically, I'd like to keep @type, @tags and @message separate from fields supplied by my upstream applications that push into Logstash via redis.
When logging application events from Python, for example, my event structure will already have a 'message' field. I don't need to have all of the fields from the orginal event indexed separately in ElasticSearch, but I'd like to keep that whole structure in @message, too, like it is now.
In other events, the use of the field 'tags' is pretty common, and I'd not like to see those replaced or polluted by Logstash' own tags.
For @type, I've wished for a while to be able to set that manually from the event structure, but from a different key. In my case 'event'.
Previously, I did a bunch of work on the GELF output to make sure there's a clear distinction between the original fields and what GELF or Logstash require. I hope we can avoid the overloading of the field namespace here, too.
For all intents, logstash now only requires one field in any logstash event, and that's @timestamp. Everything else is optional. To that end, it doesn't make sense to differentiate what is a "logstash field" because no such thing exists anymore.
Regarding @type and @tags, both of these are not required in the future. Currently in logstash 1.1.13, @type and @tags are most useful for making action decisions (filter an event, output an event). In logstash 1.2, you'll have conditionals () that let you do the same but you can make decisions based on any event property, not just the type or tags.
Conditionals allow the 'type' to no longer be a required setting.
So basically, it is my intent to have logstash only require the @timestamp field, basically, and anything else is just stuff you explicitly include.
Some inputs will provide fields, like the stdin and file inputs provide a 'message' - for example:
Hope this answers your concerns. Please let me know if you have further questions.
"polluted by Logstash' own tags"
I'm not sure I understand. Logstash doesn't own any tags, and the intent is that it only tags things upon your instruction (the config file).
Ah, I had not seen the conditionals, stuff. That seems pretty cool then.
What I mean is that all filters currently (in master) have configuration for working with tags. They always operate on a field called 'tags'. If you happen to already have a 'tags' field from your event source and then use add_tag, the tag would be added to that field, which might actually not be an array, but a string.
I'd also like to retain the ability to have access to the original event source. E.g.:
That way, I can exclude certain fields to be separately indexed, but still see the whole thing if needed.