Demystifying Kubernetes Services Packet Path
In this article i will take a stab to demystify the various K8s services and exactly how they function internally . We will be taking a closer look at the packet path from the point where the traffic enters the K8s cluster. For the rest of the article i will be assuming the reader has some context on POD-POD networking . Please refer below articles for more context on the same.
Now the reason i have picked up this topic since i find there are some confusions around this and after looking through some GitHub issues and actual tcpdump , i will attempt to clarify the packet flow . To start of lets look at what Kubernetes services are by definition
In Kubernetes, a Service is an abstraction which defines a logical set of Pods and a policy by which to access them.
The set of pods, is usually chose by the selectors configured (services can also be configured without selectors, but i will not be covering that here). These pods become the backend for the services and are referred to as endpoints
. The kube-proxy
updates these endpoints as and when pods are updated
What happens when we access a service on a Cluster ?
Before diving into the packet flow let me list out some info regarding the infrastructure on which the K8s cluster is running.
- Its is a 3 node K8s cluster running on VMware (1 Master 3 Workers)
- Host Network —
172.26.236.0/24
- Pod Network (Calico) —
192.168.0.0/16
- Service Network (service-cluster-ip-range) —
10.96.0.0/12
Since this is a VMware infrastructure , there is no support for out of the box load balancers. So i am using metallb to simulate one . However what i want to discuss is not how metallb manages , the LoadBalancer, but how the constructs interact from Kubernetes POV.
The diagram show how all services on K8s are dependent on each other and how they collapse in functionality.
Having said that we can redraw Fig1 as show below
The following is the distribution of nodes
I have deployed the following services to understand the packet flow
The service in question which user is trying to access is my-nginx(refer Fig1)
my-nginx LoadBalancer 10.106.189.218 172.26.236.180 80:31428/TCP 24h run=my-nginx
Above diagram illustrates the fact that if incoming request hits
- Node1 , then all requests are transferred to Node2 and Node3 (as the application pods are not there on Node1. The distribution of the traffic is round robin 50–50 ) . This is the default behaviour with iptables.
- However when Node2/Node3 receives the request , then Node2/Node3 respond again in a round robin fashion, the difference being traffic is never sent to Node1 since kube-proxy keeps track of the backend.
root@k8s-master-176:~/ansible-data# kubectl get ep
NAME ENDPOINTS AGE
kubernetes 172.26.236.176:6443 25h
my-nginx 192.168.100.1:80,192.168.189.131:80 25h
root@k8s-master-176:~/ansible-data# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
my-nginx-65c68bbcdf-5vmch 1/1 Running 0 25h 192.168.189.131 k8s-worker-178 <none> <none>
my-nginx-65c68bbcdf-mzzsg 1/1 Running 0 25h 192.168.100.1 k8s-worker-179 <none> <none>
root@k8s-master-176:~/ansible-data#
However we want to know the “how” and then we can focus on the “why” !!
Enter Kube-Proxy …
- Each node has a
kube-proxy
container process (Daemonset). kube-proxy
watches the Kubernetes control plane for the addition and removal of Service and Endpoint objects.- For each Service, it installs iptables rules, which capture traffic to the Service’s
clusterIP
andport
, and redirect that traffic to one of the Service's backend sets. - For each Endpoint object, it installs iptables rules which select a backend Pod.
- It maintains network rules on every node and allows network communication to your Pods from network sessions inside or outside of your cluster.
Needless to say the kube-proxy
plays a crucial role in implementing the concept of a service in Kubernetes.
Kube-Proxy Modes of Operation
User space
: This mode gets its name because the service routing takes place inkube-proxy
in the user process space instead of in the kernel network stack. It is not commonly used as it is slow and outdated.Iptables
: This mode uses Linux kernel-level Netfilter rules to configure all routing for Kubernetes Services. This mode is the default forkube-proxy
on most platforms. When load balancing for multiple backend pods, it uses unweighted round-robin scheduling.This is the mode we will be looking at.
IPVS(IP Virtual Server)
: Built on the Netfilter framework, IPVS implements Layer-4 load balancing in the Linux kernel, supporting multiple load-balancing algorithms, including least connections and shortest expected delay. Thiskube-proxy
mode became generally available in Kubernetes 1.11, but it requires the Linux kernel to have the IPVS modules loaded. It is also not as widely supported by various Kubernetes networking projects as the iptables mode.
What does kube-proxy do to enable service abstraction ?
- The first thing
kube-proxy
does after a service is created, is it starts listening on the service port on all worker nodes
#service in k8s api
my-nginx LoadBalancer 10.106.189.218 172.26.236.180 80:31428/TCP 24h run=my-nginx#node1
tcp 0 0 0.0.0.0:31428 0.0.0.0:* LISTEN 3273/kube-proxy#node2
tcp 0 0 0.0.0.0:31428 0.0.0.0:* LISTEN 3013/kube-proxy#node3
tcp 0 0 0.0.0.0:31428 0.0.0.0:* LISTEN 3567/kube-proxy
Service port is always in this range is 30000
- 32767
. Its the schedulers job to make sure there is no collision in this range.
2. In addition to this on each node kube-proxy
also creates a rule in NAT table !
#service in k8s api
my-nginx LoadBalancer 10.106.189.218 172.26.236.180 80:31428/TCP 24h run=my-nginx#NAT table data
#iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target prot opt source destination
KUBE-SVC-NPX46M4PTMTKRN6Y tcp -- anywhere 10.96.0.1 /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-TCOU7JCQXEZGVUNU udp -- anywhere 10.96.0.10 /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4 tcp -- anywhere 10.96.0.10 /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-JD5MR3NA4I4DYORP tcp -- anywhere 10.96.0.10 /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-BEPXDJBUHFCSYIC3 tcp -- anywhere 10.106.189.218 /* default/my-nginx: cluster IP */ tcp dpt:http
KUBE-FW-BEPXDJBUHFCSYIC3 tcp -- anywhere 172.26.236.180 /* default/my-nginx: loadbalancer IP */ tcp dpt:http
KUBE-NODEPORTS all -- anywhere anywhere /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
3. Now that we have the FW chain name of KUBE-FW-BEPXDJBUHFCSYIC3
#iptables -t nat -L KUBE-FW-BEPXDJBUHFCSYIC3 -v
#node-1
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-SVC-BEPXDJBUHFCSYIC3 all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */#node-2
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-SVC-BEPXDJBUHFCSYIC3 all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */#node-3
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-SVC-BEPXDJBUHFCSYIC3 all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
0 0 KUBE-MARK-DROP all -- any any anywhere anywhere /* default/my-nginx: loadbalancer IP */
4. If we look at the individual chains and follow the services we find this KUBE-SVC-BEPXDJBUHFCSYIC3
#iptables -t nat -L KUBE-SVC-BEPXDJBUHFCSYIC3 -v
#node-1
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
pkts bytes target prot opt in out source destination
1 64 KUBE-SEP-IQE2HBJHR23S7HKO all -- any any anywhere anywhere /* default/my-nginx: */ statistic mode random probability 0.50000000000
1 64 KUBE-SEP-KXZ7A5GCKOVBKJL6 all -- any any anywhere anywhere /* default/my-nginx: */#node-2
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-IQE2HBJHR23S7HKO all -- any any anywhere anywhere /* default/my-nginx: */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-KXZ7A5GCKOVBKJL6 all -- any any anywhere anywhere /* default/my-nginx: */#node-3
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SEP-IQE2HBJHR23S7HKO all -- any any anywhere anywhere /* default/my-nginx: */ statistic mode random probability 0.50000000000
0 0 KUBE-SEP-KXZ7A5GCKOVBKJL6 all -- any any anywhere anywhere /* default/my-nginx: */
5. The firewall chain shows a picture /indication of how the rules are setup on each node to modify the packet !! (below o/p is on each node). Finally here we see the pod ips as well, which basically are the backends.
Chain KUBE-SEP-IQE2HBJHR23S7HKO (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any sjo-vlan630-gw.abc.com anywhere /* default/my-nginx: */
1 64 DNAT tcp -- any any anywhere anywhere /* default/my-nginx: */ tcp to:192.168.100.1:80Chain KUBE-SEP-KXZ7A5GCKOVBKJL6 (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ all -- any any 192.168.189.131 anywhere /* default/my-nginx: */
1 64 DNAT tcp -- any any anywhere anywhere /* default/my-nginx: */ tcp to:192.168.189.131:80
KUBE-FW-BEPXDJBUHFCSYIC3
chain has three rules, each adding another chain for handling the packet.
KUBE-MARK-MASQ
adds a Netfilter mark to packets destined for themy-nginx
service which originate outside the cluster’s network. Packets with this mark will be altered in aPOSTROUTING
rule to use source network address translation (SNAT) with the node’s IP address as their source IP address.KUBE-SVC-BEPXDJBUHFCSYIC3
chain applies to all traffic bound for ourmy-nginx
service, regardless of source, and has rules for each of the service endpoints (the two pods, in this case). Which endpoint chain to use gets determined in a purely random fashion (50–50)KUBE-SEP-IQE2HBJHR23S7HKO
KUBE-MARK-MASQ
again adds a Netfilter mark to the packet for SNAT, if neededDNAT
rule sets up a destination NAT using the 10.16.0.11:8080 endpoint as the destination.
4. KUBE-SEP-KXZ7A5GCKOVBKJL6
KUBE-MARK-MASQ
again adds a Netfilter mark to the packet for SNAT, if needed.DNAT
rule sets up a destination NAT using the 10.16.1.8:8080 endpoint as the destination.
5. KUBE-MARK-DROP
adds a Netfilter mark to packets which do not have destination NAT enabled by this point. These packets will be discarded in the KUBE-FIREWALL
chain.
What did we learn !!
If you have come this far , you must have seen a few problems with this approach
- The
BlackHole
part where packets are dropped. This happens when the pod is not located on the node the external request is received on - The round robin load balancing is highly inefficient as the system scales.
This can be explained by a field in Kubernetes service — externalTrafficPolicy.
If a Service’s .spec.externalTrafficPolicy
is set to Cluster
, the client's IP address is not propagated to the end Pods (default)
By setting .spec.externalTrafficPolicy
to Local
, the client IP addresses is propagated to the end Pods, but this could result in uneven distribution of traffic. Nodes without any Pods for a particular LoadBalancer Service will fail the NLB Target Group's health check on the auto-assigned .spec.healthCheckNodePort
and not receive any traffic.
externalTrafficPolicy: Cluster
- This is the default mode of operation
- In this mode we see unnecessary network hops when the request is received on a node where the pod does not reside
- As mentioned earlier the client ip is lost here when network hops happen because of SNAT , hence destination POD only sees the proxy node ip
- In this case the traffic is forwarded/routed over the calico tunnel interface
externalTrafficPolicy: Local
- when this is used then kube-proxy applies rules only on those nodes where the backend is present !! (Refer to section What does kube-proxy do to enable service abstraction ?)
- Although for this example (Fig7), the rules are applied on Node-2 only
- However this introduces a new problem …. If there was a NLB placed before Node1 and Node2 , that NLB will balance traffic evenly across Node1 and Node2. This implies the Node2 only receives 50% of the traffic and both available PODS will service only 50% traffic !!
- This also implies that the PODS which are active will service only 25% of the traffic.
To avoid uneven distribution of traffic we can use pod anti-affinity (against the node’s hostname label) so that pods are spread out across as many nodes as possible:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
As your application scales and is spread across more nodes, imbalanced traffic should become less of a concern as a smaller percentage of traffic will be unevenly distributed:
So from above discussions externalTafficPolicy:local
is a better fit for most systems when kube-proxy
mode is iptables.
However the best option although still not popularly used is the kube-proxy
in IPVS mode. Iptables mode is more popular as a community . As per kubernetes.io
- In
ipvs
mode, kube-proxy watches Kubernetes Services and Endpoints, callsnetlink
interface to create IPVS rules accordingly and synchronizes IPVS rules with Kubernetes Services and Endpoints periodically. - This control loop ensures that IPVS status matches the desired state. When accessing a Service, IPVS directs traffic to one of the backend Pods.
- The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space.
- That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic.
IPVS provides more options for balancing traffic to backend Pods; these are:
rr
: round-robinlc
: least connection (smallest number of open connections)dh
: destination hashingsh
: source hashingsed
: shortest expected delaynq
: never queue
Conclusion:-
We went over the path traffic takes when it hits a service on K8s cluster and how the various modes are implemented. What we didn't discuss here was ingress and how that contributes to K8s !!
References:-