- Logstash (node client) haphazardly connects to Elasticsearch
- Logstash (node client) haphazardly connects to Elasticsearch.
- Since it sometimes connects, it rules out any obvious issues with
- Logstash configuration
- Elasticsearch configuration
- Network configuration
- We have 2 Elasticsearch outputs on 1 Logstash instance
- The first output in the conf file will connect, the second will get a `MasterNotDiscoveredException`.
- If we reverse the order of the outputs (2nd becomes 1st), then what was previously 2nd now connects.
- Sometimes both connect.
- I believe this is an Elasticsearch issue (and not Logstash) because
- Logstash uses Node Client which finally uses the jars provided by Elasticsearch.
- The connection request from Logstash arrives at the Elasticsearch master.
- Elasticsearch (1.0.1) with Logstash (1.4.0) (These versions allow node client).
- Elasticsearch and Logstash are in different lans, however tcpdump reveals that they are receiving each others' traffic.
- We are using unicast. Logstash is not in the `discovery.zen.ping.unicast.hosts` list. The rest of the cluster is.
- Elasticsearch cluster has 7 nodes with 3 dedicated masters that are not doing anything except being masters.
- The cluster has about 600 shards (60 indices).
- There are no errors in the log files when run at trace level.
- Most of the lines look like:
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Submitting new rescheduling cluster info update job
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Performing ClusterInfoUpdateJob
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Scheduling next run for updating cluster info in: 30s
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Skipping ClusterInfoUpdatedJob since it is disabled
[2014-08-25 16:46:54,288][TRACE][transport.netty ] [es_master1] channel closed: [id: 0x72c1a352, /10.22.65.83:59057 => /10.22.165.105:9300]
[2014-08-25 16:46:54,340][TRACE][transport.netty ] [es_master1] channel opened: [id: 0xbc141af9, /10.22.65.83:59072 => /10.22.165.105:9300]
- On the Logstash side, there are a lot of tcp resets. However, I believe that's because Logstash tries port 9300-9305, and Elasticsearch closes the ones it does not intend to use.
16:47:40.233712 IP logstash1.59310 > es_master_1.vrace: Flags [.], ack 1153, win 4, length 0
16:47:40.234470 IP logstash1.35367 > es_master_1.9301: Flags [S], seq 923830868, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.235067 IP es_master_1.9301 > logstash1.35367: Flags [R.], seq 2246437439, ack 923830869, win 0, length 0
16:47:40.240005 IP logstash1.54146 > es_master_1.9303: Flags [S], seq 3764962801, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.240539 IP es_master_1.9303 > logstash1.54146: Flags [R.], seq 2168257573, ack 3764962802, win 0, length 0
16:47:40.240748 IP logstash1.54507 > es_master_1.9305: Flags [S], seq 244766238, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.241278 IP es_master_1.9305 > logstash1.54507: Flags [R.], seq 1372279245, ack 244766239, win 0, length 0
16:47:40.241761 IP logstash1.35916 > es_master_1.9302: Flags [S], seq 4110405003, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.242295 IP es_master_1.9302 > logstash1.35916: Flags [R.], seq 2701921297, ack 4110405004, win 0, length 0
16:47:40.243685 IP logstash1.38021 > es_master_1.9304: Flags [S], seq 3891794882, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.244448 IP es_master_1.9304 > logstash1.38021: Flags [R.], seq 2041403893, ack 3891794883, win 0, length 0
- What I've tried
- Thinking it could be a timeout somewhere, I've tried increasing these in Logstash (to no avail).
- I tried changing this in Elasticsearch master
- `discovery.zen.publish_timeout: 1`
- Not everything is available from the config so for some of them, I had to patch Logstash.
- When Elasticsearch and Logstash are in the same LAN, this does not happen.
- This would seem to indicate a network issue ... but what? ... especially since they do connect sometimes.
- The traceroute from Logstash to Elasticsearch looks like:
traceroute to es_master_1 (XX.XX.XXX.105), 30 hops max, 60 byte packets
1 XX.XX.XX.2 (XX.XX.XX.2) 1.103 ms 1.197 ms 1.300 ms
2 XXX.XX.XX.169 (XXX.XX.XX.169) 0.187 ms 0.233 ms 0.220 ms
3 XXX.XX.XX.162 (XXX.XX.XX.162) 1.447 ms 1.474 ms 1.571 ms
4 XXX.XXX.XX.33 (XXX.XXX.XX.33) 1.402 ms 1.524 ms 1.588 ms
5 XXX.XXX.XX.26 (XXX.XXX.XX.26) 1.940 ms 1.979 ms 2.076 ms
6 XXX.XX.XX.190 (XXX.XX.XX.190) 0.520 ms 0.488 ms 0.463 ms
7 XXX.XX.XX.197 (XXX.XX.XX.197) 2.082 ms 1.739 ms 1.859 ms
8 es_master_1 (XX.XX.XXX.105) 0.522 ms 0.458 ms 0.497 ms