CSV filter not parsing data correctly
Description
Gliffy Diagrams
Activity

MarcM March 1, 2014 at 5:23 PMEdited
Good work around!
But i think like Jonathan, a fix for CSV is still required to address or troubleshoot why separators was not identified.
For other hand, due to AWS has added 3 more fields recently (x-host-header cs-protocol cs-bytes), i has adapted GROK pattern written by Johnathan.
filter {
grok {
type => "cloudfront"
pattern => "%{DATE_EU:date}\t%{TIME:time}\t%{WORD:x-edge-location}\t(?:%{NUMBER:sc-bytes}|-)\t%{IPORHOST:c-ip}\t%{WORD:cs-method}\t%{HOSTNAME:cs-host}\t%{NOTSPACE:cs-uri-stem}\t%{NUMBER:sc-status}\t%{GREEDYDATA:referrer}\t%{GREEDYDATA:User-Agent}\t%{GREEDYDATA:cs-uri-stem}\t%{GREEDYDATA:cookies}\t%{WORD:x-edge-result-type}\t%{NOTSPACE:x-edge-request-id}\t%{HOSTNAME:x-host-header}\t%{URIPROTO:cs-protocol}\t%{INT:cs-bytes}"
}
mutate {
type => "cloudfront"
add_field => [ "listener_timestamp", "%{date} %{time}" ]
}
date {
type => "cloudfront"
match => [ "listener_timestamp", "yy-MM-dd HH:mm:ss" ]
}
}

john Phan December 28, 2013 at 12:33 PM
work around used

john Phan November 5, 2013 at 11:30 AM
For thows who want to process cloudfront logs use the following filter.
filter {
grok {
type => "cloudfront"
pattern => "%{DATE_EU:date}\t%{TIME:time}\t%{WORD:x-edge-location}\t(?:%{NUMBER:sc-bytes}|-)\t%{IPORHOST:c-ip}\t%{WORD:cs-method}\t%{HOSTNAME:cs(Host)}\t%{NOTSPACE:cs-uri-stem}\t%{NUMBER:sc-status}\t?(?:%{QS:referrer}|-)?\t?(?:%{QS:User-Agent}|-)?\t%{NOTSPACE:cs-uri-stem}\t?(?:%{NOTSPACE:cookies}|-)?\t?(?:%{WORD:x-edge-result-type})?\t%{NOTSPACE:x-edge-request-id}"
}
mutate {
type => "cloudfront"
add_field => [ "listener_timestamp", "%{date} %{time}" ]
}
date {
type => "cloudfront"
match => [ "listener_timestamp", "yy-MM-dd HH:mm:ss" ]
}
}
This has been confirmed to be working on V1.2.1 logstash.
However a fix for CSV is still required to address or troubleshoot why separators was not identified.

john Phan November 4, 2013 at 4:37 PM
Hey Phillippe,
Got this working with Grok and will post the config later.
However would like to work out why the CSV filter not working as expected either way. Just so I know what going on with it.
Regards
John

Philippe Weber November 1, 2013 at 6:53 AM
Then the only solution for you would be to define a custom grok pattern to be used,
for inspiration I'm using the following for weblogic access logs that are also tab-separated like your cloudfront logs:
TAB \t
WEBLOGIC_ACCESSLOG %{ISO8601_DATE:date}%{TAB}%{TIME:time}%{TAB}%{NUMBER:time_taken:float}%{TAB}%{IPORHOST:c_ip}%{TAB}(?:%{IPORHOST:x_ClientIP}|-)%{TAB}%{NUMBER:sc_status:int}%{TAB}(?:%{NUMBER:bytes:int}|-)%{TAB}(?:-|%{USERNAME:x_AuthUser})%{TAB}%{WORD:cs_method}%{TAB}%{NOTSPACE:cs_uri}%{TAB}%{DATA:x_UserAgent}%{TAB}(?:-|%{DATA:x_Referer})%{TAB}%{DATA:x_Scheme}%{TAB}%{DATA:x_Protocol}%{TAB}%{GREEDYDATA:x_AcceptLanguage}
and then use a grok filter
grok {
patterns_dir=>"./patterns"
pattern => "%{WEBLOGIC_ACCESSLOG}"
}
I have the following config.
input { s3 { bucket => "test" region => "hidden" credentials => [ "test", "test" ] add_field => { "Environment" => "TEST" } add_field => { "Service" => "TEST_DATA" } add_field => { "PLATFORM" => "test" } sincedb_path => "/opt/logstash/.sincedb_test" type => "cloudfront" codec => plain { charset => "ASCII" } } } filter { if [type] == "cloudfront" { csv { separator => "\t" columns => [ "date", "time", "x-edge-location", "sc-bytes", "c-ip", "cs-method", "Host", "cs-uri-stem", "sc-status", "Referer", "User-Agent", "cs-uri-query", "Cookie", "x-edge-result-type", "x-edge-request-id" ] } } } output { elasticsearch { host => "127.0.0.1" }
I have the following output from the system in the console
{ "_index": "logstash-2013.10.31", "_type": "logs", "_id": "TAhK9DbASXOFYO1Tx7T1ZA", "_score": null, "_source": { "message": [ "2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg==\n" ], "@timestamp": "2013-10-31T15:30:40.126Z", "@version": "1", "type": "cloudfront", "Service": "TEST_DATA", "cloudfront_version": "1.0", "cloudfront_fields": "date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id", "date": "2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg==" }, "sort": [ 1383233440126 ] }
This is the logs I have concerning the paseing during a -vv debugging session
{:timestamp=>"2013-10-31T15:30:40.139000+0000", :message=>"Running csv filter", :event=>#<LogStash::Event:0x1b87966c @data={"message"=>"2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg==\n", "@timestamp"=>"2013-10-31T15:30:40.126Z", "@version"=>"1", "type"=>"cloudfront", "Service"=>"TEST_DATA", "cloudfront_version"=>"1.0", "cloudfront_fields"=>"date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id"}, @cancelled=false>, :level=>:debug, :file=>"logstash/logstash-1.2.1-flatjar.jar!/logstash/filters/csv.rb", :line=>"42", :method=>"filter"} {:timestamp=>"2013-10-31T15:30:40.149000+0000", :message=>"Event after csv filter", :event=>#<LogStash::Event:0x1b87966c @data={"message"=>["2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg==\n"], "@timestamp"=>"2013-10-31T15:30:40.126Z", "@version"=>"1", "type"=>"cloudfront", "Service"=>"TEST_DATA", "cloudfront_version"=>"1.0", "cloudfront_fields"=>"date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id", "date"=>"2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg=="}, @cancelled=false>, :level=>:debug, :file=>"logstash/logstash-1.2.1-flatjar.jar!/logstash/filters/csv.rb", :line=>"83", :method=>"filter"} {:timestamp=>"2013-10-31T15:30:46.500000+0000", :message=>"output received", :event=>#<LogStash::Event:0x1b87966c @data={"message"=>["2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg==\n"], "@timestamp"=>"2013-10-31T15:30:40.126Z", "@version"=>"1", "type"=>"cloudfront", "Service"=>"TEST_DATA", "cloudfront_version"=>"1.0", "cloudfront_fields"=>"date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id", "date"=>"2013-06-07\t14:35:52\tJFK5\t670\t206.132.241.39\tGET\td38qae89xj7q6j.cloudfront.net\t/test/test_asset.html\t200\t-\tAkamai%20Edge\t-\t-\tHit\tjissg5gl1LOUv64PH6yc8uXpJ6q5gJtx7F3cC9cb3gF4ooxqlx7tqg=="}, @cancelled=false>, :level=>:info, :file=>"(eval)", :line=>"37", :method=>"initialize"}
I can't use 1.2.2 yet as there is an issue with S3 buckets in 1.2.2 when I tried upgrading. I am looking into that in another ticket and I am trying to debug that on the side as well.
I don't know why the system has an issue after parsing the first column and never finishing the parse. I did have this working but I have messed around with the config so much I can't remember how I got it working.
Can someone help me?
Regards
John