Exposing TFTP Server as Kubernetes Service — Part 6

Darpan Malhotra
13 min readMay 26, 2022

--

In Part 5, we saw the impact of conntrack and NAT tftp helper modules and how they helped in exposing TFTP as NodePort service. After functionally exposing TFTP server pod as NodePort, it is time to consider the performance of TFTP pods exposed as NodePort service.

Note that the TFTP server is running as a container on a Linux machine, where Kubernetes networking has introduced iptables rules, which make use of NAT. This means every packet is inspected (or tracked) and mangled. So, the network can slow down the performance. This article will discuss about performance numbers measured for the containerized TFTP server exposed as NodePort service.
(Ofcourse, a tool/framework is required to measure TFTP server performance, details of which are being omitted as that is not the focus of this series of articles).
This article discusses the iterative journey I went through, striving for improved performance of TFTP server pod exposed as NodePort service.

Iteration 1:
As a baseline, we will measure the performance of TFTP server before it was containerized and deployed to Kubernetes.
Setup:

  • TFTP server runs as a Linux application.
  • Performance measurement tool sends TFTP traffic to the IP address of the Linux VM where TFTP server application runs.

Test: Measure the performance of TFTP server to transfer a file of size 5KB.
Result: 8500 file transfers per second were served.

Iteration 2:
Setup:

  • TFTP server runs as a pod on a Linux VM. The pod is exposed as a NodePort service.
  • Performance measurement tool sends TFTP traffic to the IP address of the Linux VM where TFTP server pod runs.

Test: Measure the performance of TFTP server to transfer a file of size 5KB.
Result: 1000 file transfers per second were served.

Phew !! Just 1000 file transfers/second. I was extremely disappointed to see this significant drop from 8500 to 1000 file transfers per second. Couldn’t sleep that night.

Iteration 3:
As a single TFTP pod was giving a very poor performance, next day, I thought to run more replicas of the pod on same node. The hope was, if the number of pods increase, the performance will get better.
Setup:

  • Three TFTP server pods running on the same Linux VM. These pods are exposed as a NodePort service.
  • Performance measurement tool sends TFTP traffic to the IP address of the Linux VM where TFTP server pods run.

Test: Measure the performance of TFTP server to transfer a file of size 5KB.
Result: 1000 file transfers per second were served.

Yikes !! The performance number is still same — 1000. Running multiple pods showed no improvement in performance. Couldn’t sleep even that night… two sleepless nights in a row.

Iteration 4:
I knew it was Kubernetes networking (by know we know, it actually is Linux networking) which slowed down the performance. But I had no idea where and what to investigate. Thinking deep, if conntrack and NAT have paved the way exposing TFTP server as NodePort service, maybe that’s where performance related roadblocks are.
So far we have been listing conntrack entries in the conntrack and expectation tables. The investigation began on the next day with these questions coming to my mind:

  1. Are these tables bounded i.e. what is the size of each of these tables? Put differently, how many connections can be tracked by netfilter? I could not think of any answer to this. I can only guess, conntrack entries might be in memory (RAM), so, there should be a limit on each table.
  2. What if the kernel receives more connections than that can be tracked?
    Possible answers: either drop those connections or remove the existing entries to make place for new entries. We have seen ASSURED flags on some conntrack entries. So, removing existing entries will be catastrophic, especially if an entry corresponds to a TCP connection or UDP stream. So, my guess is, any connection received after the conntrack tables hit the limit should be dropped.
  3. Why is increasing the number of pods not improving the performance? My guess is, something on the traffic receiving part of node is ignoring the number of backend pods actually processing that traffic.

To find correct answers of these questions (and not be influenced by my guesses), we have to further analyze conntrack module of Netfilter framework available in Linux kernel.
Getting back to reading mode, I found this piece of documentation, which describes tuning parameters of conntrack module. All the netfilter related kernel parameters can be listed as:

As per my understanding of the document, following netfilter related kernel parameters are of interest:
nf_conntrack_helper
nf_conntrack_buckets
nf_conntrack_max
nf_conntrack_expect_max
nf_conntrack_count
nf_conntrack_udp_timeout
nf_conntrack_udp_timeout_stream

This document gives answer to the first question: Conntrack tables are bounded by values controlled by nf_conntrack_max and nf_conntrack_expect_max.

nf_conntrack_max : Size of connection tracking table.
nf_conntrack_expect_max: Maximum size of expectation table.

As I was sending lot of traffic to measure performance, nf_conntrack_max becomes very critical. Also, with tftp helper in use, nf_conntrack_expect_max becomes even more critical.
One important thing to learn from the discussion about conntrack and expectation table entries in Part 5 is : For establishing one TFTP file transfer session:

  • There are four events in conntrack table resulting in two entries in conntrack table.
  • There are two events in expectation table resulting in one entry in expectation table.

Basically, TFTP protocol bloats conntrack compared to other protocols like DNS (UDP) or HTTP (TCP).

In this iteration, I repeated the test ran in Iteration 2, just to observe the impact of default values of netfilter related kernel parameters. So, the setup and results remain the same as in Iteration 2.

Let us examine the current values of these parameters on the worker node (learn-k8s-2):
nf_conntrack_helper=1
nf_conntrack_buckets = 65536
nf_conntrack_max = 131072
nf_conntrack_expect_max = 1024
nf_conntrack_count = 54
nf_conntrack_udp_timeout = 30
nf_conntrack_udp_timeout_stream = 180

Started the performance test (which lasts for 10 minutes) and observed the following while the test was running:

A. Monitored the count of connection entries in conntrack table using the following command:

# watch -n 1 cat /proc/sys/net/netfilter/nf_conntrack_count

The count grew rapidly to 131072 (approx) and remained around that number for rest of the duration of the test.

B. Monitored the count of connection entries in expectation table. To monitor the count of conntrack table entries, we used nf_conntrack_count. There is no equivalent for expectation table. So, while the test was running, I monitored the actual entries in expectation table. The last line shows the number of entries.

# conntrack -L expect
..
..
conntrack v1.4.4 (conntrack-tools): 1024 expectations have been shown.

The count grew rapidly to 1024 and did not increase any further.

C. System logs had following messages:

# tail -f /var/log/messages
..
..
May 20 07:50:28 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 20 07:50:28 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 20 07:50:28 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 20 07:50:28 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
..
..
May 20 07:50:38 learn-k8s-2 kernel: nf_conntrack: expectation table full
May 20 07:50:38 learn-k8s-2 kernel: nf_conntrack: expectation table full
May 20 07:50:38 learn-k8s-2 kernel: nf_conntrack: expectation table full

This aligns with the previous two observations (A and B). Kernel is clearly telling us — conntrack and expectation tables are getting full and any additional packets received are getting dropped. That answers the second question.
Thinking deep, even the third question is answered — If the there is a limit to conntrack entries on Linux machine, then the number of pods running on that machine does not matter. Only those number of connections will reach the pod as allowed by conntrack limits (nf_conntrack_max) on the node. If conntrack has dropped the packets, none of the pods are going to receive those packets anyway.

D. When the test stopped (i.e. no more incoming traffic), nf_conntrack_count did not immediately return to normal levels (approx 50 — the number before the test started).
This is because of the following parameters:
nf_conntrack_udp_timeout = 30
nf_conntrack_udp_timeout_stream = 180
In the previous article (Part 5), we have already seen these timeouts showing up in conntrack entries, but we did not know about their source. Let us discuss those points again. We saw the following events in conntrack:

First connection has timeout of 30 seconds (nf_conntrack_udp_timeout). The second connection, after getting final update, has a timeout of 180 seconds (nf_conntrack_udp_timeout_stream). And when we see the conntrack entries:

This just means 2 seconds have passed since these entries were created.

Iteration 5:
In Iteration 4, we learnt the impact of default values of netfilter related kernel parameters. The conntrack and expectaion tables were getting full. The default number 128K (conntrack table) and 1K (expectaion table) are too small. So, it is obvious that the parameters that control the size of these tables need to be tuned. So, I simply increased size of these conntrack tables.
Setup:

  • Same as in Iteration 2.
  • Use sysctl to set nf_conntrack_expect_max to 8K and nf_conntrack_max to 512K.
# vi /etc/sysctl.conf
..
..
net.netfilter.nf_conntrack_expect_max=8192
net.netfilter.nf_conntrack_max=524288
..
..
  • Load sysctls from /etc/sysctl.conf file.
# sysctl -p

Test: Measure the performance of TFTP server to transfer a file of size 5KB.
Result: 4000 file transfers per second were served.

Wow !! The TFTP server performance increased. That’s a great news. Finally, I could sleep that night. But there’s a bad news as well. While the test was running, system logs had following messages:

# tail -f /var/log/messages
..
..
May 21 08:23:44 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 21 08:23:44 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 21 08:23:44 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 21 08:23:44 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
..
..

This means, size of expectation table is sufficient but conntrack table is still getting full.

Iteration 6:
In Iteration 5, we realized size of conntrack table needs to be increased.
Setup:

  • Same as in Iteration 2.
  • Use sysctl to set nf_conntrack_expect_max to 8K and nf_conntrack_max to 1M.
# vi /etc/sysctl.conf
..
..
net.netfilter.nf_conntrack_expect_max=8192
net.netfilter.nf_conntrack_max=1048576
..
..
  • Load sysctls from /etc/sysctl.conf file.
# sysctl -p

Test: Measure the performance of TFTP server to transfer a file of size 5KB.
Result: 4100 file transfers per second were served.

The performance did not change much, and the system logs kept complaining the same:

# tail -f /var/log/messages
..
..
May 22 08:35:10 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 22 08:35:10 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 22 08:35:10 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 22 08:35:10 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 22 08:35:10 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
..
..

1M is a big enough number. In the next iteration, I just cannot increase it to 2M. Note that every conntrack entry consumes RAM. Having millions of entry will eat up many Megabytes of system’s memory.

Iteration 7:
In Iteration 6, I realized that monotonically increasing the size of conntrack table isn’t a good approach. During the test, I could see conntrack entries sitting around for long. The default timeout for UDP connections (30s and 180s) is rather long when a single file transfer operation takes few milliseconds.
In this iteration, I decided to reduce the timeouts of conntrack entries and reduce conntrack table size to 512K. That way, existing entries expire and create space for new entries.
Setup:

  • Same as in Iteration 2.
  • Use sysctl to set nf_conntrack_expect_max to 8K and nf_conntrack_max to 512K.
# vi /etc/sysctl.conf
..
..
net.netfilter.nf_conntrack_expect_max=8192
net.netfilter.nf_conntrack_max=524288
net.netfilter.nf_conntrack_udp_timeout=10
net.netfilter.nf_conntrack_udp_timeout_stream=20
..
..
  • Load sysctls from /etc/sysctl.conf file.
# sysctl -p

Test: Measure the performance of TFTP server to transfer a file of size 5KB.
Result: 4600 file transfers per second were served.

The performance has increased. While the test was running, the count of connection entries in conntrack table of the node was being monitored (same command as used on Iteration 4).

# watch -n 1 cat /proc/sys/net/netfilter/nf_conntrack_count

Interestingly, the count never reached 512K. But, to my surprise, the system logs on the node again complained of conntrack table being full:

# tail -f /var/log/messages
..
..
May 23 08:47:11 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 23 08:47:11 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 23 08:47:11 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
May 23 08:47:11 learn-k8s-2 kernel: nf_conntrack: table full, dropping packet
..
..

That’s shocking. How could the table be full if nf_conntrack_count on the node never touched the limit (nf_conntrack_max)? In fact, it was well below the limit. Thinking deep, the TFTP connections from external clients are not only seen (and tracked) by the node, but also by the TFTP server pod. The container has base image of RedHat UBI. So, the pod also must be tracking connections. In past iterations, we have been configuring systcls on the node. The question is — how are sysctls of node related to sysctls of the pod?

To answer these questions, we need to explore more about how these netfilter related systcls behave when pods are running on a node. I could only find one Kubernetes document as good resource that talks about this topic. Note that each pod has its own namespace in Linux kernel.

Turns out sysctls are categorized as :

  1. Namespaced v/s node-level (unnamespaced)
A number of sysctls are namespaced in today’s Linux kernels. This means that they can be set independently for each pod on a node. 
Only namespaced sysctls are configurable via the pod securityContext within Kubernetes.
The parameters under net.* that can be set in container networking namespace.
However, there are exceptions (e.g., net.netfilter.nf_conntrack_max and net.netfilter.nf_conntrack_expect_max can be set in container networking namespace but they are unnamespaced).

2. Safe v/s unsafe sysctls

All safe sysctls are enabled by default.
All unsafe sysctls are disabled by default and must be allowed manually by the cluster admin on a per-node basis.

My interest lies in the following 4 sysctls:
net.netfilter.nf_conntrack_expect_max
net.netfilter.nf_conntrack_max
net.netfilter.nf_conntrack_udp_timeout
net.netfilter.nf_conntrack_udp_timeout_stream

As per the document, all 4 are unsafe. So, if there is a need to configure them on TFTP server pod, they have to be allowed (whitelisted) on the node.
Thankfully, the document also mentions: nf_conntrack_max and nf_conntrack_expect_max are unnamespaced. This means, all pods (namespaces) share the same limits of conntrack entries. So, based on the values set on the node, the TFTP server pod also should see:
net.netfilter.nf_conntrack_expect_max=8192
net.netfilter.nf_conntrack_max=524288

Let us verify the sysctls on pod:

# kubectl exec tftp-server-69d49ffb8b-44m45 —- cat /proc/sys/net/netfilter/nf_conntrack_expect_max
8192
# kubectl exec tftp-server-69d49ffb8b-44m45 —- cat /proc/sys/net/netfilter/nf_conntrack_max
524288
# kubectl exec tftp-server-69d49ffb8b-44m45 —- cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout
30
# kubectl exec tftp-server-69d49ffb8b-44m45 —- cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream
180

So, it turns out that the pod sees same values of nf_conntrack_expect_max and nf_conntrack_max as on node. But nf_conntrack_udp_timeout and nf_conntrack_udp_timeout_stream are having default values (30s and 180s) and they are different from node (10s and 20s). Because the timeouts are high for pod, old conntrack entries in the pod are not getting expired to create place for new conntrack entries. This means, these timeouts need to be adjusted on the pod as well.

Iteration 8:
It is clear after Iteration 7 that just increasing the size of conntrack tables and decreasing UDP connection timeouts on the node isn’t sufficient. Additionally, sysctls related to UDP timeouts need to be adjusted for pod as well.

As both the timeout related sysctls are unsafe, as per the Kubernetes documentation, we have to manually allow them on the node. The Kubernetes agent that runs on the node is kubelet and its config file is located at /var/lib/kubelet/config.yaml. KubeletConfiguration object is documented here. The documentation says:

allowedUnsafeSysctls []string — A comma separated whitelist of unsafe sysctls or sysctl patterns

Setup:

  • Modify the KubeletConfiguration object defined in Kubelet’s config file.
  • The config change requires restarting kubelet service.
# systemctl restart kubelet.service
  • Configure securityContext in PodSpec to set these sysctls.

Test: Measure the performance of TFTP server pod to transfer a file of size 5KB.
Result: 5100 file transfers per second were served. Also verified sysctls on pod.

# kubectl exec tftp-server-5576d84bd9–5g5jx —- cat /proc/sys/net/netfilter/nf_conntrack_expect_max
8192
# kubectl exec tftp-server-5576d84bd9–5g5jx —- cat /proc/sys/net/netfilter/nf_conntrack_max
524288
# kubectl exec tftp-server-5576d84bd9–5g5jx —- cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout
10
# kubectl exec tftp-server-5576d84bd9–5g5jx —- cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout_stream
20

This is increased performance compared to the levels we started (Iteration 2). Finally, in this iteration, system logs did not complain of conntrack table getting full. While the test was running, the count of connection entries in conntrack table was being monitored on node and pod.

  • To monitor the count on node, same command as shown in Iteration 4 was used.
# watch -n 1 cat /proc/sys/net/netfilter/nf_conntrack_count
  • To monitor the count on pod, the following command was used.
# watch -n 1 “kubectl exec -it tftp-server-5576d84bd9–5g5jx —- cat /proc/sys/net/netfilter/nf_conntrack_count”

Interestingly, the count peaked at 200K. This means, we can even reduce nf_conntrack_max from 512K to 256K and save some RAM.

Iteration 9:
This is same test as ran in Iteration 8, but with increased number of TFTP pods.
Setup:

  • Update replica count in TFTP server deployment manifest file (tftp-server-deployment.yaml) to 2 and create the pods. Both pods run on same node (learn-k8s-2).

Test: Measure the performance of TFTP server pod to transfer a file of size 5KB.
Result: 5800 file transfers per second were served.

In this article, we learnt that conntrack has to be tuned for running high-performance applications in Kubernetes. As TFTP is UDP based protocol, we tuned UDP timeouts. Should your application use TCP, tune the various TCP timeouts related sysctls.

We now have a high-performant TFTP service exposed as NodePort. In the next article, we will expose TFTP server pod as LoadBalancer service in the on-prem environment.

--

--

Darpan Malhotra

4x AWS Certified including Advanced Networking — Speciality