Details
-
Type:
Bug/Feature
-
Status: Resolved (View workflow)
-
Resolution: Fixed
-
Affects Version/s: 1.4.0
-
Fix Version/s: 1.4.0
-
Labels:
Description
In our production environment, I've observed that we are unable to push above 600 msg/sec on udp input. After reviewing the code in depth, I think we have two levels of bottleneck.
The first level is that on a single listener port, we only have one thread pulling data off the socket queue,(versus, say TCP, where you can easily spin a thread per connection, or have worker threads like some of the other connectors. I considered whether this is a real bottleneck, whether the jRuby socket code is pretty efficient. Jordan Sissel has experiments suggesting the TCP socket code is capable of >100k msg/sec on a single conn, so it follows that the UDP socket should be similar. So, I could write a new socket connection handler, but I’m not sure that’s really our bottleneck. And if it is, we can use multiple sockets to effectively “multi-thread” and that’s way easier than writing nonblocking socket code.
The second level I believe is that every single UDP datagram that gets pulled off the socket buffer is getting decoded, validated as UTF-8, converted TO UTF-8 as needed, transformed into an event, and then pushed into the plugin output_queue for filter processing. This is all done serially, on a single thread, inline with the datagram being pulled off the buffer. I think we’re massively hitting this because our delimiters are not UTF-8, but I’m betting this is typical with lots of users having Windows-1252 or ISO-8859-1 or whatever is spewed out. Other users are reporting ~500/sec rates…
So what did I change... I added an input queue so that the main udp plugin thread can just pull data off the socket queue as fast as possible. I set up worker thread pool, with the worker threads configurable in the config file. These threads pull from the input queue and do all the decode/transform/convert work, then output the cooked events.
In prod, I was able to eliminate data loss and handle at least 2x the traffic we’re processing, using two workers and any buffer size >2,000.
I’ve been trying to do some pure benchmarking, but so far I can’t seem to replicate the ~600 msg/sec bottleneck we saw in prod. I’m easily able to do 9k+ msg/sec in test, with or without the new code.