Demystifying Kubernetes Services Packet Path

Abhishek Mitra
11 min readJul 22, 2020

--

In this article i will take a stab to demystify the various K8s services and exactly how they function internally . We will be taking a closer look at the packet path from the point where the traffic enters the K8s cluster. For the rest of the article i will be assuming the reader has some context on POD-POD networking . Please refer below articles for more context on the same.

Now the reason i have picked up this topic since i find there are some confusions around this and after looking through some GitHub issues and actual tcpdump , i will attempt to clarify the packet flow . To start of lets look at what Kubernetes services are by definition

In Kubernetes, a Service is an abstraction which defines a logical set of Pods and a policy by which to access them.

The set of pods, is usually chose by the selectors configured (services can also be configured without selectors, but i will not be covering that here). These pods become the backend for the services and are referred to as endpoints. The kube-proxy updates these endpoints as and when pods are updated

What happens when we access a service on a Cluster ?

Fig 1: External User accessing service on a K8s cluster

Before diving into the packet flow let me list out some info regarding the infrastructure on which the K8s cluster is running.

  • Its is a 3 node K8s cluster running on VMware (1 Master 3 Workers)
  • Host Network — 172.26.236.0/24
  • Pod Network (Calico) — 192.168.0.0/16
  • Service Network (service-cluster-ip-range) — 10.96.0.0/12

Since this is a VMware infrastructure , there is no support for out of the box load balancers. So i am using metallb to simulate one . However what i want to discuss is not how metallb manages , the LoadBalancer, but how the constructs interact from Kubernetes POV.

A representation of how services on K8s are dependent
Kubernetes Service Relationship

The diagram show how all services on K8s are dependent on each other and how they collapse in functionality.

Having said that we can redraw Fig1 as show below

Fig2 . Fig1 redrawn to show more details inside

The following is the distribution of nodes

Fig3. Node distribution

I have deployed the following services to understand the packet flow

Fig4: Pod distribution

The service in question which user is trying to access is my-nginx(refer Fig1)

my-nginx     LoadBalancer   10.106.189.218   172.26.236.180   80:31428/TCP   24h   run=my-nginx
Fig5: Pod distribution of Fig4

Above diagram illustrates the fact that if incoming request hits

  1. Node1 , then all requests are transferred to Node2 and Node3 (as the application pods are not there on Node1. The distribution of the traffic is round robin 50–50 ) . This is the default behaviour with iptables.
  2. However when Node2/Node3 receives the request , then Node2/Node3 respond again in a round robin fashion, the difference being traffic is never sent to Node1 since kube-proxy keeps track of the backend.
root@k8s-master-176:~/ansible-data# kubectl get ep
NAME ENDPOINTS AGE
kubernetes 172.26.236.176:6443 25h
my-nginx 192.168.100.1:80,192.168.189.131:80 25h

root@k8s-master-176:~/ansible-data# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-nginx-65c68bbcdf-5vmch 1/1 Running 0 25h 192.168.189.131 k8s-worker-178 <none> <none>
my-nginx-65c68bbcdf-mzzsg 1/1 Running 0 25h 192.168.100.1 k8s-worker-179 <none> <none>
root@k8s-master-176:~/ansible-data#

However we want to know the “how” and then we can focus on the “why” !!

Enter Kube-Proxy …

  • Each node has a kube-proxy container process (Daemonset).
  • kube-proxy watches the Kubernetes control plane for the addition and removal of Service and Endpoint objects.
  • For each Service, it installs iptables rules, which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service's backend sets.
  • For each Endpoint object, it installs iptables rules which select a backend Pod.
  • It maintains network rules on every node and allows network communication to your Pods from network sessions inside or outside of your cluster.

Needless to say the kube-proxy plays a crucial role in implementing the concept of a service in Kubernetes.

Kube-Proxy Modes of Operation

  • User space : This mode gets its name because the service routing takes place in kube-proxy in the user process space instead of in the kernel network stack. It is not commonly used as it is slow and outdated.
  • Iptables : This mode uses Linux kernel-level Netfilter rules to configure all routing for Kubernetes Services. This mode is the default for kube-proxy on most platforms. When load balancing for multiple backend pods, it uses unweighted round-robin scheduling. This is the mode we will be looking at.
  • IPVS(IP Virtual Server): Built on the Netfilter framework, IPVS implements Layer-4 load balancing in the Linux kernel, supporting multiple load-balancing algorithms, including least connections and shortest expected delay. This kube-proxy mode became generally available in Kubernetes 1.11, but it requires the Linux kernel to have the IPVS modules loaded. It is also not as widely supported by various Kubernetes networking projects as the iptables mode.

What does kube-proxy do to enable service abstraction ?

  1. The first thing kube-proxy does after a service is created, is it starts listening on the service port on all worker nodes
#service in k8s api
my-nginx LoadBalancer 10.106.189.218 172.26.236.180 80:31428/TCP 24h run=my-nginx
#node1
tcp 0 0 0.0.0.0:31428 0.0.0.0:* LISTEN 3273/kube-proxy
#node2
tcp 0 0 0.0.0.0:31428 0.0.0.0:* LISTEN 3013/kube-proxy
#node3
tcp 0 0 0.0.0.0:31428 0.0.0.0:* LISTEN 3567/kube-proxy

Service port is always in this range is 30000 - 32767. Its the schedulers job to make sure there is no collision in this range.

2. In addition to this on each node kube-proxy also creates a rule in NAT table !

#service in k8s api
my-nginx LoadBalancer 10.106.189.218 172.26.236.180 80:31428/TCP 24h run=my-nginx
#NAT table data
#iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-JD5MR3NA4I4DYORP tcp -- anywhere 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-BEPXDJBUHFCSYIC3 tcp -- anywhere 10.106.189.218 /* default/my-nginx: cluster IP */ tcp dpt:http
KUBE-FW-BEPXDJBUHFCSYIC3 tcp -- anywhere 172.26.236.180 /* default/my-nginx: loadbalancer IP */ tcp dpt:http
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

3. Now that we have the FW chain name of KUBE-FW-BEPXDJBUHFCSYIC3

#iptables -t nat -L KUBE-FW-BEPXDJBUHFCSYIC3 -v
#node-1
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-SVC-BEPXDJBUHFCSYIC3 all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
#node-2
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-SVC-BEPXDJBUHFCSYIC3 all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
#node-3
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-SVC-BEPXDJBUHFCSYIC3 all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */

4. If we look at the individual chains and follow the services we find this KUBE-SVC-BEPXDJBUHFCSYIC3

#iptables -t nat -L KUBE-SVC-BEPXDJBUHFCSYIC3 -v
#node-1

Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
pkts bytes target prot opt in out source destination
1 64 KUBE-SEP-IQE2HBJHR23S7HKO all -- any any anywhere anywhere /* default/my-nginx: */ statistic mode random probability 0.50000000000
1 64 KUBE-SEP-KXZ7A5GCKOVBKJL6 all -- any any anywhere anywhere /* default/my-nginx: */
#node-2
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-IQE2HBJHR23S7HKO all -- any any anywhere anywhere /* default/my-nginx: */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-KXZ7A5GCKOVBKJL6 all -- any any anywhere anywhere /* default/my-nginx: */
#node-3
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-IQE2HBJHR23S7HKO all -- any any anywhere anywhere /* default/my-nginx: */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-KXZ7A5GCKOVBKJL6 all -- any any anywhere anywhere /* default/my-nginx: */

5. The firewall chain shows a picture /indication of how the rules are setup on each node to modify the packet !! (below o/p is on each node). Finally here we see the pod ips as well, which basically are the backends.

Chain KUBE-SEP-IQE2HBJHR23S7HKO (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any sjo-vlan630-gw.abc.com anywhere /* default/my-nginx: */
1 64 DNAT tcp -- any any anywhere anywhere /* default/my-nginx: */ tcp to:192.168.100.1:80
Chain KUBE-SEP-KXZ7A5GCKOVBKJL6 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any 192.168.189.131 anywhere /* default/my-nginx: */
1 64 DNAT tcp -- any any anywhere anywhere /* default/my-nginx: */ tcp to:192.168.189.131:80
Fig6: IPTable chains in packet flow

KUBE-FW-BEPXDJBUHFCSYIC3 chain has three rules, each adding another chain for handling the packet.

  1. KUBE-MARK-MASQ adds a Netfilter mark to packets destined for the my-nginx service which originate outside the cluster’s network. Packets with this mark will be altered in a POSTROUTING rule to use source network address translation (SNAT) with the node’s IP address as their source IP address.
  2. KUBE-SVC-BEPXDJBUHFCSYIC3 chain applies to all traffic bound for our my-nginx service, regardless of source, and has rules for each of the service endpoints (the two pods, in this case). Which endpoint chain to use gets determined in a purely random fashion (50–50)
  3. KUBE-SEP-IQE2HBJHR23S7HKO
  • KUBE-MARK-MASQ again adds a Netfilter mark to the packet for SNAT, if needed
  • DNAT rule sets up a destination NAT using the 10.16.0.11:8080 endpoint as the destination.

4. KUBE-SEP-KXZ7A5GCKOVBKJL6

  • KUBE-MARK-MASQ again adds a Netfilter mark to the packet for SNAT, if needed.
  • DNAT rule sets up a destination NAT using the 10.16.1.8:8080 endpoint as the destination.

5. KUBE-MARK-DROP adds a Netfilter mark to packets which do not have destination NAT enabled by this point. These packets will be discarded in the KUBE-FIREWALL chain.

What did we learn !!

If you have come this far , you must have seen a few problems with this approach

  • The BlackHole part where packets are dropped. This happens when the pod is not located on the node the external request is received on
  • The round robin load balancing is highly inefficient as the system scales.

This can be explained by a field in Kubernetes service — externalTrafficPolicy.

If a Service’s .spec.externalTrafficPolicy is set to Cluster, the client's IP address is not propagated to the end Pods (default)

By setting .spec.externalTrafficPolicy to Local, the client IP addresses is propagated to the end Pods, but this could result in uneven distribution of traffic. Nodes without any Pods for a particular LoadBalancer Service will fail the NLB Target Group's health check on the auto-assigned .spec.healthCheckNodePort and not receive any traffic.

Fig7: difference between externalTrafficPolicy

externalTrafficPolicy: Cluster

  • This is the default mode of operation
  • In this mode we see unnecessary network hops when the request is received on a node where the pod does not reside
  • As mentioned earlier the client ip is lost here when network hops happen because of SNAT , hence destination POD only sees the proxy node ip
  • In this case the traffic is forwarded/routed over the calico tunnel interface
Fig8: SNAT( Source Network Address Translation)

externalTrafficPolicy: Local

  • when this is used then kube-proxy applies rules only on those nodes where the backend is present !! (Refer to section What does kube-proxy do to enable service abstraction ?)
  • Although for this example (Fig7), the rules are applied on Node-2 only
  • However this introduces a new problem …. If there was a NLB placed before Node1 and Node2 , that NLB will balance traffic evenly across Node1 and Node2. This implies the Node2 only receives 50% of the traffic and both available PODS will service only 50% traffic !!
Fig9: externalTrafficPolicy: Local with blackhole
  • This also implies that the PODS which are active will service only 25% of the traffic.
Fig9: externalTrafficPolicy: Local traffic distribution

To avoid uneven distribution of traffic we can use pod anti-affinity (against the node’s hostname label) so that pods are spread out across as many nodes as possible:

affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname

As your application scales and is spread across more nodes, imbalanced traffic should become less of a concern as a smaller percentage of traffic will be unevenly distributed:

So from above discussions externalTafficPolicy:local is a better fit for most systems when kube-proxy mode is iptables.

However the best option although still not popularly used is the kube-proxy in IPVS mode. Iptables mode is more popular as a community . As per kubernetes.io

  • In ipvs mode, kube-proxy watches Kubernetes Services and Endpoints, calls netlink interface to create IPVS rules accordingly and synchronizes IPVS rules with Kubernetes Services and Endpoints periodically.
  • This control loop ensures that IPVS status matches the desired state. When accessing a Service, IPVS directs traffic to one of the backend Pods.
  • The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space.
  • That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic.

IPVS provides more options for balancing traffic to backend Pods; these are:

  • rr: round-robin
  • lc: least connection (smallest number of open connections)
  • dh: destination hashing
  • sh: source hashing
  • sed: shortest expected delay
  • nq: never queue

Conclusion:-

We went over the path traffic takes when it hits a service on K8s cluster and how the various modes are implemented. What we didn't discuss here was ingress and how that contributes to K8s !!

References:-

--

--