Exposing TFTP Server as Kubernetes Service — Part 3

Darpan Malhotra
8 min readMay 26, 2022

--

Till Part 2 of this series, we created an on-prem Kubernetes cluster with Calico CNI, deployed TFTP server pod and exposed it as ClusterIP service.
The TFTP server was accessible as ClusterIP service by other pods in the cluster. Although, TFTP clients are outside the cluster. We know that, Kubernetes service of type=NodePort is one of the simplest way to expose network service offered by pods outside the cluster.
By default, Kubernetes exposes NodePort services in the range: 30000–32767. But client devices running out of Kubernetes cluster need TFTP network service on standard port 69. So, service manifest file needs to be updated to set type: NodePort and manually specify nodePort:69.

Let us apply the manifest.

# kubectl apply -f tftp-server-service.yaml 
The Service “tftp-server” is invalid: spec.ports[0].nodePort: Invalid value: 69: provided port is not in the valid range. The range of valid ports is 30000–32767

Oops… the default range ports of NodePort service is 30000–32767. So, we need to adjust this range to accommodate port 69. As this cluster is bootstrapped by kubeadm, the static pod manifests are available at — /etc/kubernetes/manifests. Modify kube-apiserver.yamand add the flag service-node-port-range=20-2000.

Wait for kube-apiserver to be automatically re-deployed by Kubelet. Now kube-apiserver runs with updated configuration. So, let us apply TFTP service manifest file again.

# kubectl apply -f tftp-server-service.yaml 
service/tftp-server configured

Service is created successfully, it is now time to access it from outside the cluster.

Let an external TFTP client connect (client-node-4: 10.10.100.197) to learn-k8s-2 (10.10.100.208) on port 69.

# tftp 10.10.100.208
tftp> get dummy.txt
Transfer timed out.

Duh… File could not be transferred and TFTP operation timed out. This is a huge problem (and the most important focus area of this series). TFTP clients are external to Kubernetes cluster and file cannot be transferred to them. What could be the problem here? The client is timing out — which means it is not receiving any response from TFTP server. Is the TFTP server pod even receiving the request from external client?
To answer these question, the network engineer in me wants to inspect the packets and analyze what is going on. As discussed in Part 2, we have the following options of interfaces to capture packets at:
A. Interface of server pod (eth0)
B. Other side of veth pair (cali*)
C. Tunnel interface of node (tunl0)
D. Physical ethernet interface of nodes (ens160)

On the client side, only option D is valid. On the server side, we will go with capturing packets as per option B,C and D.

A. Packets on client node (client-node-4) at ethernet interface of node (ens160)

Observations:

  • Client (10.10.100.197) sends RRQ to worker node running server pod (10.10.100.208), but never gets a response [ 1 packet captured ].

B. Packets on server node (learn-k8s-2) at ethernet interface of node (ens160)

Observations:

  • Worker node running TFTP pod (10.10.100.208) receives RRQ from client (10.10.100.197), but never sends a response [ 1 packet captured ].

C. Packets on server node (learn-k8s-2) at tunnel interface (tunl0)

Observations:

  • The tunnel interface does not see any packet at all [ 0 packets captured ].
    This means, tunnel interface has no role to play when packet directly reaches the node running the target pod. But what if the client had sent packet to a node which does not have server pod running (i.e. learn-k8s-3)? Maybe, tunnel interface has a role in that case. We will explore this traffic path in next article of this series.

D. Packets on server node (learn-k8s-2) at veth interface (calic0bf1043683)

Observations:

  • Server pod receives a RRQ, with source IP=10.10.100.208 and destination IP=192.168.29.67.
  • The actual packet had source IP=10.10.100.197 and destination IP=10.10.100.208. Clearly, on the way to server pod, both SNAT and DNAT got applied to client’s actual packet.
  • Who applied these NAT transformations — of course, iptables !!! We will shortly analyze iptables rules on the node running TFTP server pod (learn-k8s-2). The quick summary is:
  1. DNAT happens on PREROUTING chain of nat table.
  2. DNAT has translated destination IP address to 192.168.29.67 i.e. packet gets routed to TFTP server pod (Well, that’s why calic0bf1043683 is even seeing this packet).
  3. SNAT happens at POSTROUTING chain of nat table.
  4. SNAT has translated source IP address to 10.10.100.208 i.e. worker node’s IP address.
  • Overall, connection tuples in original packet are 10.10.100.197:58358 → 10.10.100.208:69. After NAT, the tuples are changed to 10.10.100.208:51770 →192.168.29.67:69
  • TFTP server pod has received RRQ and sent a data packet in response. The response is 192.168.29.67:60932 → 10.10.100.208:51770.
  • But, this response packet hasn’t been seen going out from ens160 interface of server node to actual client (10.10.109.197).
  • An ICMP message from server node to pod is seen which is reporting an error: Destination unreachable (Port unreachable).

As promised above, we will now analyze iptables rules on the server node (learn-k8s-2) to understand the packet gets mangled by NAT.

A. Every incoming packet goes through PREROUTING chain. Kubernetes makes use of PREROUTING chain in nat table to implement its services.

Every incoming packet will match the rule to jump to KUBE-SERVICES chain.

B. KUBE-SERVICES chain is the top level collection of all Kubernetes services. There is a single chain for all NodePort services, which is the last rule.

Client has generated UDP packet destined to 10.10.100.208:69. None of the KUBC-SVC-* rules match. The last rule matches and hence the packet will jump to KUBE-NODEPORTS chain.

C. KUBE-NODEPORTS chain is a collection of all NodePort services.

As, we have only one NodePort service defined in this Kubernetes cluster, we see only one rule. And this rule matches the packet coming from client (protocol = UDP, destination port = 69). So, the packet is jumped to KUBE-EXT-HJOS6SHZL66STTLG.

D. In the KUBE-EXT-HJOS6SHZL66STTLG chain, there are two rules and packet matches both rules.

As per first rule, packet is MARKed.

Then, the packet jumps to KUBE-SVC-HJOS6SHZL66STTLG chain.

E. Every KUBE-SVC-* has a collection of relevant Service Endpoints (KUBE-SEP-*). This is where actual load-balancing to KUBE-SEP-* chains happens.
As we have only one pod (i.e. only one Service Endpoint), so KUBE-SVC-HJOS6SHZL66STTLG chain has only one KUBE-SEP-* entry. If there were more pods, we would have seen more KUBE-SEP-* entries in this chain with a target being selected on the basis of statistics and probability.

As the source IP of packet from client does not belong to 192.168.0.0/16, it will match the first rule, and also the second rule i.e. KUBE-SEP-XMDC2IVYMAURXD3K

F. Each KUBE-SEP-* represents the actual Service Endpoint ( # kubectl get ep). This is where DNAT happens.

Here, the incoming packet is DNATed to 192.168.29.67:69. This means, the incoming packet needs to be FORWARDED to 192.168.29.67.

G. As we saw packet getting SNATed too. So, let us analyze POSTROUTING rules in the nat table.

Every packet matches the 3rd rule and jumps to KUBE-POSTROUTING chain.

H. KUBE-POSTROUTING will apply SNAT on the forwarded packet. Note that MASQUERADE is a special form of SNAT where source IP is dynamically picked from outgoing interface.

Overall, kube-proxy adds rules to iptables to DNAT the traffic for NodeIP:NodePort (10.10.100.208:69) to PodIP:TargetPort (192.168.29.67:69). Also, iptables rules to SNAT the traffic are also added.
This explains the journey of incoming packet as it passes through iptables on the node.

Let us get back to the fact that, we are seeing ICMP error message after TFTP server pod sent first response packet. Thanks to the creators to ICMP — The error message reported tells me that something is off with ports used in response packet. We already know, TFTP server does not use port 69 as the source port in the response packet. In this case, it is 60932. One can imagine very basic implementation of NAT would involve creating an entry of translation in a table (database) when first packet comes in. And use the same entry to do reverse-NAT when response packet goes out i.e. track the connection. In case of TFTP and our example packet capture, the two flows are:
INCOMING: 10.10.100.208:51770 →192.168.29.67:69
OUTGOING: 192.168.29.67:60932 → 10.10.100.208:51770
Given that, request is going to port 69 and response is coming from port 60932, it implies NAT module of Linux kernel is unable to establish a fact that these two flows are related. They are considered as two independent and unrelated flows. TFTP appears to be not a NAT-friendly protocol. Hence, there is not response packet seen by client (10.10.100.197).

(Do make a note of the phrases used above — “entry of translation in a table”, “track the connection” and “two flows are related”. We will revisit them in next articles.)

Overall, the situation is NodePort services are built upon NAT (atleast when kube-proxy is in use) and TFTP does not seem to work properly with NAT.
If you read any literature on NodePort, you will be convinced that it just works. All you need to do is, provide transport layer protocol (TCP/UDP) and port number in the service definition, and things will work. Unfortunately, things did not work for me. It ruined the plans of running TFTP server as a containerized application in Kubernetes and be accessible for external clients. I have a challenge that TFTP server pod cannot be exposed as NodePort service… Challenge Accepted !

To resolve the problem, we need to build more understanding of iptables and operations like NAT performed by it. Enter, netfilter framework !
We will discuss about netfilter in next article of this series.

--

--

Darpan Malhotra

4x AWS Certified including Advanced Networking — Speciality