Uploaded image for project: 'logstash'
  1. LOGSTASH-2283

Logstash (node client) haphazardly connects to Elasticsearch

    Details

      Description

        1. Logstash (node client) haphazardly connects to Elasticsearch
          1. Cross-posted:
          1. Summary
      • Logstash (node client) haphazardly connects to Elasticsearch.
      • Since it sometimes connects, it rules out any obvious issues with
      • Logstash configuration
      • Elasticsearch configuration
      • Network configuration
      • We have 2 Elasticsearch outputs on 1 Logstash instance
      • The first output in the conf file will connect, the second will get a `MasterNotDiscoveredException`.
      • If we reverse the order of the outputs (2nd becomes 1st), then what was previously 2nd now connects.
      • Sometimes both connect.
      • I believe this is an Elasticsearch issue (and not Logstash) because
      • Logstash uses Node Client which finally uses the jars provided by Elasticsearch.
      • The connection request from Logstash arrives at the Elasticsearch master.
          1. Setup
      • Elasticsearch (1.0.1) with Logstash (1.4.0) (These versions allow node client).
      • Elasticsearch and Logstash are in different lans, however tcpdump reveals that they are receiving each others' traffic.
      • We are using unicast. Logstash is not in the `discovery.zen.ping.unicast.hosts` list. The rest of the cluster is.
      • Elasticsearch cluster has 7 nodes with 3 dedicated masters that are not doing anything except being masters.
      • The cluster has about 600 shards (60 indices).
          1. Details
      • There are no errors in the log files when run at trace level.
      • Most of the lines look like:

      ```
      [2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Submitting new rescheduling cluster info update job
      [2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Performing ClusterInfoUpdateJob
      [2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Scheduling next run for updating cluster info in: 30s
      [2014-08-25 16:46:53,142][TRACE][cluster ] [es_master1] Skipping ClusterInfoUpdatedJob since it is disabled
      [2014-08-25 16:46:54,288][TRACE][transport.netty ] [es_master1] channel closed: [id: 0x72c1a352, /10.22.65.83:59057 => /10.22.165.105:9300]
      [2014-08-25 16:46:54,340][TRACE][transport.netty ] [es_master1] channel opened: [id: 0xbc141af9, /10.22.65.83:59072 => /10.22.165.105:9300]
      ```

      • On the Logstash side, there are a lot of tcp resets. However, I believe that's because Logstash tries port 9300-9305, and Elasticsearch closes the ones it does not intend to use.

      ```
      16:47:40.233712 IP logstash1.59310 > es_master_1.vrace: Flags [.], ack 1153, win 4, length 0
      16:47:40.234470 IP logstash1.35367 > es_master_1.9301: Flags [S], seq 923830868, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
      16:47:40.235067 IP es_master_1.9301 > logstash1.35367: Flags [R.], seq 2246437439, ack 923830869, win 0, length 0
      16:47:40.240005 IP logstash1.54146 > es_master_1.9303: Flags [S], seq 3764962801, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
      16:47:40.240539 IP es_master_1.9303 > logstash1.54146: Flags [R.], seq 2168257573, ack 3764962802, win 0, length 0
      16:47:40.240748 IP logstash1.54507 > es_master_1.9305: Flags [S], seq 244766238, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
      16:47:40.241278 IP es_master_1.9305 > logstash1.54507: Flags [R.], seq 1372279245, ack 244766239, win 0, length 0
      16:47:40.241761 IP logstash1.35916 > es_master_1.9302: Flags [S], seq 4110405003, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
      16:47:40.242295 IP es_master_1.9302 > logstash1.35916: Flags [R.], seq 2701921297, ack 4110405004, win 0, length 0
      16:47:40.243685 IP logstash1.38021 > es_master_1.9304: Flags [S], seq 3891794882, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 14], length 0
      16:47:40.244448 IP es_master_1.9304 > logstash1.38021: Flags [R.], seq 2041403893, ack 3891794883, win 0, length 0
      ```

          1. What I've tried
      • Thinking it could be a timeout somewhere, I've tried increasing these in Logstash (to no avail).
      • `discovery.zen.ping_timeout`
      • `discovery.zen.join_timeout`
      • `discovery.zen.ping.fd.ping_timeout`
      • `discovery.zen.ping.fd.ping_retries`
      • I tried changing this in Elasticsearch master
      • `discovery.zen.publish_timeout: 1`
      • Not everything is available from the config so for some of them, I had to patch Logstash.
      1. Clues?
      • When Elasticsearch and Logstash are in the same LAN, this does not happen.
      • This would seem to indicate a network issue ... but what? ... especially since they do connect sometimes.
      • The traceroute from Logstash to Elasticsearch looks like:

      ```
      traceroute to es_master_1 (XX.XX.XXX.105), 30 hops max, 60 byte packets
      1 XX.XX.XX.2 (XX.XX.XX.2) 1.103 ms 1.197 ms 1.300 ms
      2 XXX.XX.XX.169 (XXX.XX.XX.169) 0.187 ms 0.233 ms 0.220 ms
      3 XXX.XX.XX.162 (XXX.XX.XX.162) 1.447 ms 1.474 ms 1.571 ms
      4 XXX.XXX.XX.33 (XXX.XXX.XX.33) 1.402 ms 1.524 ms 1.588 ms
      5 XXX.XXX.XX.26 (XXX.XXX.XX.26) 1.940 ms 1.979 ms 2.076 ms
      6 XXX.XX.XX.190 (XXX.XX.XX.190) 0.520 ms 0.488 ms 0.463 ms
      7 XXX.XX.XX.197 (XXX.XX.XX.197) 2.082 ms 1.739 ms 1.859 ms
      8 es_master_1 (XX.XX.XXX.105) 0.522 ms 0.458 ms 0.497 ms
      ```

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                logstash-dev Logstash Developers (Inactive)
                Reporter:
                gseng Gavin Seng
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: