fluentd out_forward plugin is stuck longer than expected

Hao Gao
Hadoop Noob
Published in
4 min readFeb 18, 2019

Recently I am working log ingestion project which we use fluentd to collect and stream the log to Kinesis for further data processing. We use a very standard fluentd cluster setup described here. I will not debate whether the setup is optimal or not. Just try to give some context.

In this setup, fluentd on the nodes which produce logs act as log forwarders. Aggregators will centralize the logs and then push the logs to other places, e.g. Kinesis.

Our log forwarder still on 0.12 version. In the log forwarder, it uses out_forward plugin to forward the data. Configuration looks something like:

<match pattern>
@type forward
expire_dns_cache 0
dns_round_robin true
heartbeat_type tcp
heartbeat_interval 1
<server>
host my-log-aggregator.example.com
port 24224
</server>
</match>

As you can see, we set dns_round_robin to true because we put several log aggregators behind the dns. We hope each aggregator behind the dns could get a fair change to ingest the data. In other words, we could have a client-side load balancing. If you are familiar with fluentd, you know we can also achieve client-side load balancing through multiple <server> </server> section. We didn’t do that because we have hundreds of instances, if we ever want to add/remove log aggregators to the aggregator cluster, we need to changes config on all the log forwarders and then restart them. By using the dns_round_robin, we just need to update the dns. That’s why we also set expire_dns_cache to 0 so we should get updated dns after TTL.

For load balancing, this setup works well. We do see traffic on the forwarder splits to all aggregators evenly.

We also noticed one issue. Let’s say we have 20 aggregators sitting behind the dns. I stop one of them but do not remove it from the dns. So basically 20 ip addresses behind the dns but only 19 have aggregator listening on it. I would expect we will have 19/20 chance to hit a good ip. Data on the log forwarder should be moving although performance is slightly impacted. But in reality, we found right after we stop one aggregator, the log forwarders begin to have trouble sending data. And it doesn’t seem to recover until we restart the stopped aggregator.

To debug this. I ssh to one log forwarder and use strace to monitor the network activities.

strace -p $PID -e trace=network -f

The output looks similar down below. You can see all the system calls related to network

If you are familiar with fluentd, you probably know out_forward plugin uses two threads. One for heartbeat; another one actually sends data to the destination. How can you tell which thread’s functionality? I usually just check them one by one. Each thread follows different patterns. For example, the heartbeat thread will run every 1 second to resolve the dns first and then ping one ip behind it. If you change the heartbeat_interval to 5, it should run every 5 seconds. For the thread actually sends the data, if you remove the -e trace=network flag, you can observe the buffer data path in the strace output.

By examing both threads, strace shows the threads are stuck for near 2 minutes. After 2 minutes, it moves a little bit and then stuck for another 2 minutes. I figured out that the stuck system call is connect (just man connect on your laptop). After I looked up on the man page, I believe it could accept a timeout parameter. Great, Let’s find out where we can pass the timeout value.

After I researched a little bit on the out_forward plugin source code, I think the issue is related to this line of code.

def connect(node)
# TODO unix socket?
TCPSocket.new(node.resolved_host, node.port)
end

The above method is called every time out_forward plugin tries to do heartbeat or forward data. Because no matter it is heartbeat or forwarding data, it always starts with initializing a new socket. So when it tries to initialize a TCP connection, connect system call (not the connect method) is get called. Then it waits there for 2 minutes.

If we look at the TCPSocket.new, we cannot see where we can set the timeout. There is also nothing about the default timeout value. If we open irb and type:

require 'socket'
TCPSocket.new(INVALID_IP, 8888)

Make sure INVALID_IP a none exist ip address. You may see a different timeout than my 2 minutes. I think it depends on the OS setting. Although I didn’t figure out where it is configured in the OS. Let me know in the comment if you know.

It seems we cannot use TCPSocket api, we need something else to bypass it. Luckily since we use ruby 2.0 above, there is a new api Socket.tcp which actually can accept a timeout.

So here is what I got. Very simple. I tested the code change in a scenario where I set the timout to 1s and give 1 bad ip out of 2 ips in total. The out_forward works pretty well with a slight performance decrease

--

--