FlunetBit-Fluentd: TCP Story

Published in

FluentD

5 min readOct 25, 2021

Yes, this is a pretty good use case where we have Fluent bit responsible to collect the logs and the host in the Fluent bit output (configuration file) is a DNS pointing to NLB(network load balancer) that as result points to a fleet of EC2 machines.

Let me share the pictorial view first!! 😄
And here you go!!😎

This picture or diagram is designed for a particular use case only!! — Fig 1. The high-level flow of requirements

I don’t want to go inside the requirements directly but rather start with the problem we face while setting up the infrastructure or configuration for achieving a very highly available and scalable architecture along with no data loss!!

General problems (For TCP)we can face while setting up the above flow consist:

My TCP connection with fluentd machine behaves differently (OMG!!)
Why there is data loss. (😒😩)
Am I sure about the TCP connections maintained at the Fluentd level? 🧐
What should be the exact or efficient way to drop the TCP connection in either direction?
How to check everything about the TCP connections? 👨‍💻🕵️‍♂️🕵️‍♂️🕵️‍♂️

So, I have come across all these questions while setting up the architecture and configurations at the Fluentd level. (You also might have seen these !! if not, then you should at least start thinking about these points).

Without any further Ado, let me tell you that there are some best practices we should follow to avoid any TCP problems or to find the answers we have for the above questions.

Let’s go then!!! 🛹🛼🏄‍♂️🏄‍♂️🏄‍♂️🏄‍♂️🏄‍♂️🏄‍♂️

If I highlight those best practices then I come up with the following things in my mind.

Configuration file or code block at FluentBit and Fluentd level
Some server tweaks
Some way to track the TCP call between client and server
Some best metrics to observe

The magic starts now!!!

Configuration File

The Fluent bit has an output section in conf which simply says: hey buddy, tell me what should I do with an incoming chunk, shall I send it somewhere like TCP call or something else (drop in s3, kafka etc.)?

[OUTPUT]
Name          forward/tcp
Match         *
Host          127.0.0.1
Port          24284
Shared_Key    secret
Self_Hostname flb.local
tls           on
tls.verify    off

We have forward for type/name as we need to send the chunk directly to fluentd and there is good interoperability between fluent bit and fluentd. But we use TCP for an endpoint talking to fluentd further.

Following are the main or key properties that impact the TCP connections(nature of TCP connection between client and server) so understand those closely!!

net.connect_timeout : Set maximum time expressed in seconds to wait for a TCP connection to be established, this includes the TLS handshake time. The default value is 10 Seconds
net.keepalive: Enable or disable connection keepalive support. Accepts a boolean value: on / off. The default value is ON
net.keepalive_idle_timeout: Set maximum time expressed in seconds for an idle keepalive connection. The default value is 30 seconds
net.tcp_keepalive: Enable or disable TCP keepalive support. Accepts a boolean value: on / off. The default value is OFF
net.tcp_keepalive_time: Interval between the last data packet sent and the first TCP keepalive probe.
net.tcp_keepalive_interval: Interval between TCP keepalive probes when no response is received on a keep idle probe.
net.tcp_keepalive_probes: Number of unacknowledged probes to consider a connection dead.
net.keepalive_max_recycle: Set the maximum number of times a keepalive connection can be used before it is destroyed. The default value is Zero/0

By keeping those properties very intelligently can save resources as well! And it can make the TCP connections very smooth like it’s kept alive property decide about either keep the TCP alive or not and if alive then for how long (it is decided by keepalive_time). So in very high traffic, the proper values of these config properties show a great result!!

Now, the NLB(network load balancer) comes into the picture. BTW default idle timeout of NLB is 350 Seconds and it can’t be changed! So, while setting up the keep alive nature of client and server then those values always should be corresponding to this idle timeout only!!

Fluted has a source to capture the incoming data from the fluent bit via a load balancer and that is like below one:

<source>   
  @type forward   
  port 24224   
  bind 0.0.0.0 
</source>

The load balancer keeps calling to fluentd machine to check its health and in the case of NLB(network load balancer) if we have set up the same TCP port we have in the listener then it directly calls to fluentd only!! Because of target group will have the same port to listen on via listeners only!!
Now the default nature of fluentd is to [RST] reset the connections if there is any [FIN] finish call from the source of the TCP connection. That could be a good fit in case of a tradeoff for the data or packet loss!!

But in general, we don’t want our data should lose not even a single chunk of data. So to remove this, we have one option [linger_timeout], the default value of that is Zero. So whenever there is a [FIN] request from the source, it will directly close or [RST] reset the connections and the data or packet could lose!! Keep this [linger_timout] value as per the TCP keep-alive nature.

If you use NLB then you can see [RST] reset counts for targets at NLB metrics, it always shows some higher number because of [linger_timout] only!

Server Tweaks

Definitely, the server’s default properties play an important role to keep the TCP call or connection a good setup. Here, please see below server config while setting up the fluentd on the machine.

net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200

TCP keepalive process waits for two hours (7200 secs) for socket activity before sending the first keepalive probe and then resends it every 75 seconds. As long as there are TCP/IP socket communications going on and active, no keepalive packets are needed.

Track The TCP Call

We have set up and followed the best practices but still, we should not believe unless we see somewhere that the TCP works perfectly now. So for that, I say use TCPDUMP.

We should capture the tcpdump on the server for some period and then observer those dumps by using wireshark or any other tool!!

To capture the tcpdump and write it into a file using the below command.

tcpdump -w sampledumps.pcap

Then import this dump file in the Wireshark and understand the nature of the TCP connections. It looks like 👇

If something comes bad then see the config at both application and server levels. Keep playing with the stuff until unless you get the desired results which is no packet loss or not drained connection for a long period.

Metrics To Observe

We should focus on the below metrics while observing the TCP connections or config of fluentd and fluent bit.

[RST] reset counts from either side
Packet loss or not, you can use metrics endpoint released by fluentd and fluent bit (https://docs.fluentd.org/monitoring-fluentd/monitoring-rest-api)
Time of TCP keep-alive
Chunk size allowed at both fluent bit and fluentd level.
Buffer nature (Although it is really a solid part of the Fluentd still better to observe)

References:

FlunetBit-Fluentd: TCP Story

Note: Always welcome any improvements or suggestions!!😄😄😄

Written by Vishal Jain