Logstash (node client) haphazardly connects to Elasticsearch

Description

  1.  

    1. Logstash (node client) haphazardly connects to Elasticsearch

  1.  

    1.  

      1. Cross-posted:

  1.  

    1.  

      1. Summary

  • Logstash (node client) haphazardly connects to Elasticsearch.

  • Since it sometimes connects, it rules out any obvious issues with

  • Logstash configuration

  • Elasticsearch configuration

  • Network configuration

  • We have 2 Elasticsearch outputs on 1 Logstash instance

  • The first output in the conf file will connect, the second will get a `MasterNotDiscoveredException`.

  • If we reverse the order of the outputs (2nd becomes 1st), then what was previously 2nd now connects.

  • Sometimes both connect.

  • I believe this is an Elasticsearch issue (and not Logstash) because

  • Logstash uses Node Client which finally uses the jars provided by Elasticsearch.

  • The connection request from Logstash arrives at the Elasticsearch master.

  1.  

    1.  

      1. Setup

  • Elasticsearch (1.0.1) with Logstash (1.4.0) (These versions allow node client).

  • Elasticsearch and Logstash are in different lans, however tcpdump reveals that they are receiving each others' traffic.

  • We are using unicast. Logstash is not in the `discovery.zen.ping.unicast.hosts` list. The rest of the cluster is.

  • Elasticsearch cluster has 7 nodes with 3 dedicated masters that are not doing anything except being masters.

  • The cluster has about 600 shards (60 indices).

  1.  

    1.  

      1. Details

  • There are no errors in the log files when run at trace level.

  • Most of the lines look like:

```
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Submitting new rescheduling cluster info update job
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Performing ClusterInfoUpdateJob
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Scheduling next run for updating cluster info in: 30s
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Skipping ClusterInfoUpdatedJob since it is disabled
[2014-08-25 16:46:54,288][TRACE][transport.netty ] [es_master1] channel closed: [id: 0x72c1a352, /10.22.65.83:59057 => /10.22.165.105:9300]
[2014-08-25 16:46:54,340][TRACE][transport.netty ] [es_master1] channel opened: [id: 0xbc141af9, /10.22.65.83:59072 => /10.22.165.105:9300]
```

  • On the Logstash side, there are a lot of tcp resets. However, I believe that's because Logstash tries port 9300-9305, and Elasticsearch closes the ones it does not intend to use.

```
16:47:40.233712 IP logstash1.59310 > es_master_1.vrace: Flags [.], ack 1153, win 4, length 0
16:47:40.234470 IP logstash1.35367 > es_master_1.9301: Flags [S], seq 923830868, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.235067 IP es_master_1.9301 > logstash1.35367: Flags [R.], seq 2246437439, ack 923830869, win 0, length 0
16:47:40.240005 IP logstash1.54146 > es_master_1.9303: Flags [S], seq 3764962801, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.240539 IP es_master_1.9303 > logstash1.54146: Flags [R.], seq 2168257573, ack 3764962802, win 0, length 0
16:47:40.240748 IP logstash1.54507 > es_master_1.9305: Flags [S], seq 244766238, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.241278 IP es_master_1.9305 > logstash1.54507: Flags [R.], seq 1372279245, ack 244766239, win 0, length 0
16:47:40.241761 IP logstash1.35916 > es_master_1.9302: Flags [S], seq 4110405003, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.242295 IP es_master_1.9302 > logstash1.35916: Flags [R.], seq 2701921297, ack 4110405004, win 0, length 0
16:47:40.243685 IP logstash1.38021 > es_master_1.9304: Flags [S], seq 3891794882, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.244448 IP es_master_1.9304 > logstash1.38021: Flags [R.], seq 2041403893, ack 3891794883, win 0, length 0
```

  1.  

    1.  

      1. What I've tried

  • Thinking it could be a timeout somewhere, I've tried increasing these in Logstash (to no avail).

  • `discovery.zen.ping_timeout`

  • `discovery.zen.join_timeout`

  • `discovery.zen.ping.fd.ping_timeout`

  • `discovery.zen.ping.fd.ping_retries`

  • I tried changing this in Elasticsearch master

  • `discovery.zen.publish_timeout: 1`

  • Not everything is available from the config so for some of them, I had to patch Logstash.

  1. Clues?

  • When Elasticsearch and Logstash are in the same LAN, this does not happen.

  • This would seem to indicate a network issue ... but what? ... especially since they do connect sometimes.

  • The traceroute from Logstash to Elasticsearch looks like:

```
traceroute to es_master_1 (XX.XX.XXX.105), 30 hops max, 60 byte packets
1 XX.XX.XX.2 (XX.XX.XX.2) 1.103 ms 1.197 ms 1.300 ms
2 XXX.XX.XX.169 (XXX.XX.XX.169) 0.187 ms 0.233 ms 0.220 ms
3 XXX.XX.XX.162 (XXX.XX.XX.162) 1.447 ms 1.474 ms 1.571 ms
4 XXX.XXX.XX.33 (XXX.XXX.XX.33) 1.402 ms 1.524 ms 1.588 ms
5 XXX.XXX.XX.26 (XXX.XXX.XX.26) 1.940 ms 1.979 ms 2.076 ms
6 XXX.XX.XX.190 (XXX.XX.XX.190) 0.520 ms 0.488 ms 0.463 ms
7 XXX.XX.XX.197 (XXX.XX.XX.197) 2.082 ms 1.739 ms 1.859 ms
8 es_master_1 (XX.XX.XXX.105) 0.522 ms 0.458 ms 0.497 ms
```

Gliffy Diagrams

Activity

Show:
Incomplete

Details

Assignee

Reporter

Affects versions

Created August 25, 2014 at 10:47 PM
Updated February 6, 2015 at 10:56 PM
Resolved February 6, 2015 at 10:56 PM