Duplicate Events at Midnight

Description

Hi everyone,
Ive been struggling and working through setting up logstash - elasticsearch -kibana and working with it for the past month now.
Im at the point of now playing with using logstash and log-forwarder, trying to use filters and send to elasticsearch. After looking through kibana it looks like I have duplicate logs being sent through to ES right at midnight.
I have done some googling and troubleshooting and am still having issues pinpointing the problem.

I am running ElasticSearch 1.1.0, logstash-1.4.0, and kibana 3.0.0..

Here is the situation, we have a central syslog server that stores a bunch of logs. The logs are stored and rotated daily at midnight with the naming convention of logname_YYYY-MM-DD_messages.log. There is a clean up script that zips up anything older than 7 days, and deletes anything older than 30.
It looks like at midnight every night there is a spike in logs where data is duplicated while searching in kibana. I have verified this by doing a search within kibana and ES by tying together a string with a date time stamp combined with other unique characters within the message portion of the log. For example: BARW-INET-FW01 AND "Apr 30 08:41:19" AND "75416". I hit typically two results. Once at the original time the message was generated, and the other right around midnight.

Attached are my rules set up within logstash-forwarder on the central log server, and the configurations on the elasticsearch server. They are very basic at this point, and Im sure I can have my configurations on the logstash forwarder a bit more simple.

Initially I thought within the logstash-forwarder config file, I could be able to add the actual date variable into the filename (https://logstash.jira.com/browse/LOGSTASH-608):
From "paths": [ "/var/log/syslog/OREM-BARW-INET-FW01*.log" ] – to trying "paths": [ "/var/log/syslog/10.3.5.23_%{YEAR}%{MONTHNUM}%{MONTHDAY}*.log" ] which did not work after restarting the logstash-forwader service. Is there a way to use variable type pattern matching to specify which files search for?

I have a feeling that it has to do with the log-rotate at midnight, where logstash somehow sends data again. I read a post to a similar issue where Jordan (https://groups.google.com/forum/#!topic/logstash-users/4xaN6LlYyWA) responded with setting '--window-size 1' on lumberjack. I could not really find any documentation on where this is done. Can anyone lend a hand in this? Are there any strategies/best practices to fix this issue. Essentially the logs are getting duplicated at midnight taking up twice the amount of space it should..

Im not sure if adding a date variable in the input section of the logstash-forwarder file would help...but I would really like to find a way to resolve this. Is it a matter of using logstash forwarder vs just an agent?

Thank you for the help....

Activity

Show:
Philippe Weber
May 16, 2014, 10:03 AM

As you are using a wildcard, is it not simply that logstash read the new file created by logrotate.

You could restrict your file input to only read the current logfile (without the timestamp) and do not reread the others file, no?

Dave Marcinowski
May 19, 2014, 7:32 PM

Hey Philippe
Thanks for the response.
Im not sure that I totally understand, so I apologize for the questions:

To your first point, of using the wild card. If logstash was reading the new file created by logrotate, shouldnt that be a near empty file that is created at 12AM?
How would I get logstash to start reading the new logfiles after the log rotate without rereading the previous days, file?

Which leads to your second point, how would I go about that logstash only read the current logfile. Would having a date variable within the configuration help with that. In other words, today is 5/19 and at midnight it will be 5/20. If I had a date variable couldnt logstash start reading the new file with the 5/20 date as that now would be the current date. All other files would be ignored.....Is there a more simple alternative that I am not seeing. I guess my confusion is why logstash would reread a file it has been parsing all day, when it should start parsing a new file at midnight.....

How would my input look to only reading the current logfile?

Again thanks you for the patience and help

Dave

Philippe Weber
May 20, 2014, 4:54 AM

What is sure is that the file input does not have any logic related to "current day" as you expect. It would require a complex logic to read the last lines of "previous" log file at rotation time. and what about more complex rotation config (by size, by hours)

A kind of standard way to use log rotation is to have one primary log file without timestamp /var/log/my-application.log that is always the current log file. And then you could have rotation by size of by date that creates backup like /var/log/my-application-2014-05-0-19.log. In this configuration, pointing logstash only to the current /var/log/my-application.log is enough, no need for wildcard and no dual reading of same duplicate lines;

What I suspect in your case is that logstash detects a new file around midnight and starts reading from it but this file contains data already processed thus creating doublons, you should carefully review your log rotation script/config/... to understand why it is the case

Assignee

Logstash Developers

Reporter

Dave Marcinowski

Affects versions

Configure