Logstash (node client) haphazardly connects to Elasticsearch
Description
Gliffy Diagrams
Activity
Show:
Incomplete
Details
Details
Assignee

Reporter

Affects versions
Created August 25, 2014 at 10:47 PM
Updated February 6, 2015 at 10:56 PM
Resolved February 6, 2015 at 10:56 PM
Logstash (node client) haphazardly connects to Elasticsearch
Cross-posted:
https://github.com/elasticsearch/elasticsearch/issues/7445
https://logstash.jira.com/browse/LOGSTASH-2283
Summary
Logstash (node client) haphazardly connects to Elasticsearch.
Since it sometimes connects, it rules out any obvious issues with
Logstash configuration
Elasticsearch configuration
Network configuration
We have 2 Elasticsearch outputs on 1 Logstash instance
The first output in the conf file will connect, the second will get a `MasterNotDiscoveredException`.
If we reverse the order of the outputs (2nd becomes 1st), then what was previously 2nd now connects.
Sometimes both connect.
I believe this is an Elasticsearch issue (and not Logstash) because
Logstash uses Node Client which finally uses the jars provided by Elasticsearch.
The connection request from Logstash arrives at the Elasticsearch master.
Setup
Elasticsearch (1.0.1) with Logstash (1.4.0) (These versions allow node client).
Elasticsearch and Logstash are in different lans, however tcpdump reveals that they are receiving each others' traffic.
We are using unicast. Logstash is not in the `discovery.zen.ping.unicast.hosts` list. The rest of the cluster is.
Elasticsearch cluster has 7 nodes with 3 dedicated masters that are not doing anything except being masters.
The cluster has about 600 shards (60 indices).
Details
There are no errors in the log files when run at trace level.
Most of the lines look like:
```
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Submitting new rescheduling cluster info update job
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Performing ClusterInfoUpdateJob
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Scheduling next run for updating cluster info in: 30s
[2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Skipping ClusterInfoUpdatedJob since it is disabled
[2014-08-25 16:46:54,288][TRACE][transport.netty ] [es_master1] channel closed: [id: 0x72c1a352, /10.22.65.83:59057 => /10.22.165.105:9300]
[2014-08-25 16:46:54,340][TRACE][transport.netty ] [es_master1] channel opened: [id: 0xbc141af9, /10.22.65.83:59072 => /10.22.165.105:9300]
```
On the Logstash side, there are a lot of tcp resets. However, I believe that's because Logstash tries port 9300-9305, and Elasticsearch closes the ones it does not intend to use.
```
16:47:40.233712 IP logstash1.59310 > es_master_1.vrace: Flags [.], ack 1153, win 4, length 0
16:47:40.234470 IP logstash1.35367 > es_master_1.9301: Flags [S], seq 923830868, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.235067 IP es_master_1.9301 > logstash1.35367: Flags [R.], seq 2246437439, ack 923830869, win 0, length 0
16:47:40.240005 IP logstash1.54146 > es_master_1.9303: Flags [S], seq 3764962801, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.240539 IP es_master_1.9303 > logstash1.54146: Flags [R.], seq 2168257573, ack 3764962802, win 0, length 0
16:47:40.240748 IP logstash1.54507 > es_master_1.9305: Flags [S], seq 244766238, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.241278 IP es_master_1.9305 > logstash1.54507: Flags [R.], seq 1372279245, ack 244766239, win 0, length 0
16:47:40.241761 IP logstash1.35916 > es_master_1.9302: Flags [S], seq 4110405003, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.242295 IP es_master_1.9302 > logstash1.35916: Flags [R.], seq 2701921297, ack 4110405004, win 0, length 0
16:47:40.243685 IP logstash1.38021 > es_master_1.9304: Flags [S], seq 3891794882, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
16:47:40.244448 IP es_master_1.9304 > logstash1.38021: Flags [R.], seq 2041403893, ack 3891794883, win 0, length 0
```
What I've tried
Thinking it could be a timeout somewhere, I've tried increasing these in Logstash (to no avail).
`discovery.zen.ping_timeout`
`discovery.zen.join_timeout`
`discovery.zen.ping.fd.ping_timeout`
`discovery.zen.ping.fd.ping_retries`
I tried changing this in Elasticsearch master
`discovery.zen.publish_timeout: 1`
Not everything is available from the config so for some of them, I had to patch Logstash.
Clues?
When Elasticsearch and Logstash are in the same LAN, this does not happen.
This would seem to indicate a network issue ... but what? ... especially since they do connect sometimes.
The traceroute from Logstash to Elasticsearch looks like:
```
traceroute to es_master_1 (XX.XX.XXX.105), 30 hops max, 60 byte packets
1 XX.XX.XX.2 (XX.XX.XX.2) 1.103 ms 1.197 ms 1.300 ms
2 XXX.XX.XX.169 (XXX.XX.XX.169) 0.187 ms 0.233 ms 0.220 ms
3 XXX.XX.XX.162 (XXX.XX.XX.162) 1.447 ms 1.474 ms 1.571 ms
4 XXX.XXX.XX.33 (XXX.XXX.XX.33) 1.402 ms 1.524 ms 1.588 ms
5 XXX.XXX.XX.26 (XXX.XXX.XX.26) 1.940 ms 1.979 ms 2.076 ms
6 XXX.XX.XX.190 (XXX.XX.XX.190) 0.520 ms 0.488 ms 0.463 ms
7 XXX.XX.XX.197 (XXX.XX.XX.197) 2.082 ms 1.739 ms 1.859 ms
8 es_master_1 (XX.XX.XXX.105) 0.522 ms 0.458 ms 0.497 ms
```