Demystifying Kubernetes Services Packet Path

11 min readJul 22, 2020

In this article i will take a stab to demystify the various K8s services and exactly how they function internally . We will be taking a closer look at the packet path from the point where the traffic enters the K8s cluster. For the rest of the article i will be assuming the reader has some context on POD-POD networking . Please refer below articles for more context on the same.

A closer look at Networking with Kubernetes !

The following article is an attempt to look at what happens under the cover when pods try to talk to each other across…

medium.com

Container Networking Anatomy using Linux Namespaces — Part1

The following article is a humble attempt to clearly explain the differences in the networking concepts in a…

medium.com

Now the reason i have picked up this topic since i find there are some confusions around this and after looking through some GitHub issues and actual tcpdump , i will attempt to clarify the packet flow . To start of lets look at what Kubernetes services are by definition

In Kubernetes, a Service is an abstraction which defines a logical set of Pods and a policy by which to access them.

The set of pods, is usually chose by the selectors configured (services can also be configured without selectors, but i will not be covering that here). These pods become the backend for the services and are referred to as endpoints. The kube-proxy updates these endpoints as and when pods are updated

What happens when we access a service on a Cluster ?

Fig 1: External User accessing service on a K8s cluster

Before diving into the packet flow let me list out some info regarding the infrastructure on which the K8s cluster is running.

Its is a 3 node K8s cluster running on VMware (1 Master 3 Workers)
Host Network — 172.26.236.0/24
Pod Network (Calico) — 192.168.0.0/16
Service Network (service-cluster-ip-range) — 10.96.0.0/12

Since this is a VMware infrastructure , there is no support for out of the box load balancers. So i am using metallb to simulate one . However what i want to discuss is not how metallb manages , the LoadBalancer, but how the constructs interact from Kubernetes POV.

A representation of how services on K8s are dependent — Kubernetes Service Relationship

The diagram show how all services on K8s are dependent on each other and how they collapse in functionality.

Having said that we can redraw Fig1 as show below

Fig2 . Fig1 redrawn to show more details inside

The following is the distribution of nodes

Fig3. Node distribution

I have deployed the following services to understand the packet flow

Fig4: Pod distribution

The service in question which user is trying to access is my-nginx(refer Fig1)

my-nginx     LoadBalancer   10.106.189.218   172.26.236.180   80:31428/TCP   24h   run=my-nginx

Above diagram illustrates the fact that if incoming request hits

Node1 , then all requests are transferred to Node2 and Node3 (as the application pods are not there on Node1. The distribution of the traffic is round robin 50–50 ) . This is the default behaviour with iptables.
However when Node2/Node3 receives the request , then Node2/Node3 respond again in a round robin fashion, the difference being traffic is never sent to Node1 since kube-proxy keeps track of the backend.

root@k8s-master-176:~/ansible-data# kubectl get ep
NAME         ENDPOINTS                             AGE
kubernetes   172.26.236.176:6443                   25h
my-nginx     192.168.100.1:80,192.168.189.131:80   25h

root@k8s-master-176:~/ansible-data# kubectl get pod -o wide
NAME                        READY   STATUS    RESTARTS   AGE   IP                NODE             NOMINATED NODE   READINESS GATES
my-nginx-65c68bbcdf-5vmch   1/1     Running   0          25h   192.168.189.131   k8s-worker-178   <none>           <none>
my-nginx-65c68bbcdf-mzzsg   1/1     Running   0          25h   192.168.100.1     k8s-worker-179   <none>           <none>
root@k8s-master-176:~/ansible-data#

However we want to know the “how” and then we can focus on the “why” !!

Enter Kube-Proxy …

Each node has a kube-proxy container process (Daemonset).
kube-proxy watches the Kubernetes control plane for the addition and removal of Service and Endpoint objects.
For each Service, it installs iptables rules, which capture traffic to the Service’s clusterIP and port, and redirect that traffic to one of the Service's backend sets.
For each Endpoint object, it installs iptables rules which select a backend Pod.
It maintains network rules on every node and allows network communication to your Pods from network sessions inside or outside of your cluster.

Needless to say the kube-proxy plays a crucial role in implementing the concept of a service in Kubernetes.

Kube-Proxy Modes of Operation

User space : This mode gets its name because the service routing takes place in kube-proxy in the user process space instead of in the kernel network stack. It is not commonly used as it is slow and outdated.
Iptables : This mode uses Linux kernel-level Netfilter rules to configure all routing for Kubernetes Services. This mode is the default for kube-proxy on most platforms. When load balancing for multiple backend pods, it uses unweighted round-robin scheduling. This is the mode we will be looking at.
IPVS(IP Virtual Server): Built on the Netfilter framework, IPVS implements Layer-4 load balancing in the Linux kernel, supporting multiple load-balancing algorithms, including least connections and shortest expected delay. This kube-proxy mode became generally available in Kubernetes 1.11, but it requires the Linux kernel to have the IPVS modules loaded. It is also not as widely supported by various Kubernetes networking projects as the iptables mode.

What does kube-proxy do to enable service abstraction ?

The first thing kube-proxy does after a service is created, is it starts listening on the service port on all worker nodes

#service in k8s api
my-nginx     LoadBalancer   10.106.189.218   172.26.236.180   80:31428/TCP   24h   run=my-nginx#node1
tcp        0      0 0.0.0.0:31428           0.0.0.0:*               LISTEN      3273/kube-proxy#node2
tcp        0      0 0.0.0.0:31428           0.0.0.0:*               LISTEN      3013/kube-proxy#node3
tcp        0      0 0.0.0.0:31428           0.0.0.0:*               LISTEN      3567/kube-proxy

Service port is always in this range is 30000 - 32767. Its the schedulers job to make sure there is no collision in this range.

2. In addition to this on each node kube-proxy also creates a rule in NAT table !

#service in k8s api
my-nginx     LoadBalancer   10.106.189.218   172.26.236.180   80:31428/TCP   24h   run=my-nginx#NAT table data 
#iptables -t nat -L KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  anywhere             10.96.0.1            /* default/kubernetes:https cluster IP */ tcp dpt:https
KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns cluster IP */ udp dpt:domain
KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:domain
KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  anywhere             10.96.0.10           /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
KUBE-SVC-BEPXDJBUHFCSYIC3  tcp  --  anywhere             10.106.189.218       /* default/my-nginx: cluster IP */ tcp dpt:http
KUBE-FW-BEPXDJBUHFCSYIC3  tcp  --  anywhere             172.26.236.180       /* default/my-nginx: loadbalancer IP */ tcp dpt:http
KUBE-NODEPORTS  all  --  anywhere             anywhere             /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

3. Now that we have the FW chain name of KUBE-FW-BEPXDJBUHFCSYIC3

#iptables -t nat -L KUBE-FW-BEPXDJBUHFCSYIC3 -v
#node-1 
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */
    0     0 KUBE-SVC-BEPXDJBUHFCSYIC3  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */#node-2 
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */
    0     0 KUBE-SVC-BEPXDJBUHFCSYIC3  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */#node-3 
Chain KUBE-FW-BEPXDJBUHFCSYIC3 (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */
    0     0 KUBE-SVC-BEPXDJBUHFCSYIC3  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */
    0     0 KUBE-MARK-DROP  all  --  any    any     anywhere             anywhere             /* default/my-nginx: loadbalancer IP */

4. If we look at the individual chains and follow the services we find this KUBE-SVC-BEPXDJBUHFCSYIC3

#iptables -t nat -L KUBE-SVC-BEPXDJBUHFCSYIC3 -v
#node-1 
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
 pkts bytes target     prot opt in     out     source               destination
    1    64 KUBE-SEP-IQE2HBJHR23S7HKO  all  --  any    any     anywhere             anywhere             /* default/my-nginx: */ statistic mode random probability 0.50000000000
    1    64 KUBE-SEP-KXZ7A5GCKOVBKJL6  all  --  any    any     anywhere             anywhere             /* default/my-nginx: */#node-2 
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-IQE2HBJHR23S7HKO  all  --  any    any     anywhere             anywhere             /* default/my-nginx: */ statistic mode random probability 0.50000000000
    0     0 KUBE-SEP-KXZ7A5GCKOVBKJL6  all  --  any    any     anywhere             anywhere             /* default/my-nginx: */#node-3 
Chain KUBE-SVC-BEPXDJBUHFCSYIC3 (3 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-IQE2HBJHR23S7HKO  all  --  any    any     anywhere             anywhere             /* default/my-nginx: */ statistic mode random probability 0.50000000000
    0     0 KUBE-SEP-KXZ7A5GCKOVBKJL6  all  --  any    any     anywhere             anywhere             /* default/my-nginx: */

5. The firewall chain shows a picture /indication of how the rules are setup on each node to modify the packet !! (below o/p is on each node). Finally here we see the pod ips as well, which basically are the backends.

Chain KUBE-SEP-IQE2HBJHR23S7HKO (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  any    any     sjo-vlan630-gw.abc.com  anywhere             /* default/my-nginx: */
    1    64 DNAT       tcp  --  any    any     anywhere             anywhere             /* default/my-nginx: */ tcp to:192.168.100.1:80Chain KUBE-SEP-KXZ7A5GCKOVBKJL6 (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  any    any     192.168.189.131      anywhere             /* default/my-nginx: */
    1    64 DNAT       tcp  --  any    any     anywhere             anywhere             /* default/my-nginx: */ tcp to:192.168.189.131:80

KUBE-FW-BEPXDJBUHFCSYIC3 chain has three rules, each adding another chain for handling the packet.

KUBE-MARK-MASQ adds a Netfilter mark to packets destined for the my-nginx service which originate outside the cluster’s network. Packets with this mark will be altered in a POSTROUTING rule to use source network address translation (SNAT) with the node’s IP address as their source IP address.
KUBE-SVC-BEPXDJBUHFCSYIC3 chain applies to all traffic bound for our my-nginx service, regardless of source, and has rules for each of the service endpoints (the two pods, in this case). Which endpoint chain to use gets determined in a purely random fashion (50–50)
KUBE-SEP-IQE2HBJHR23S7HKO

KUBE-MARK-MASQ again adds a Netfilter mark to the packet for SNAT, if needed
DNAT rule sets up a destination NAT using the 10.16.0.11:8080 endpoint as the destination.

4. KUBE-SEP-KXZ7A5GCKOVBKJL6

KUBE-MARK-MASQ again adds a Netfilter mark to the packet for SNAT, if needed.
DNAT rule sets up a destination NAT using the 10.16.1.8:8080 endpoint as the destination.

5. KUBE-MARK-DROP adds a Netfilter mark to packets which do not have destination NAT enabled by this point. These packets will be discarded in the KUBE-FIREWALL chain.

What did we learn !!

If you have come this far , you must have seen a few problems with this approach

The BlackHole part where packets are dropped. This happens when the pod is not located on the node the external request is received on
The round robin load balancing is highly inefficient as the system scales.

This can be explained by a field in Kubernetes service — externalTrafficPolicy.

If a Service’s .spec.externalTrafficPolicy is set to Cluster, the client's IP address is not propagated to the end Pods (default)

By setting .spec.externalTrafficPolicy to Local, the client IP addresses is propagated to the end Pods, but this could result in uneven distribution of traffic. Nodes without any Pods for a particular LoadBalancer Service will fail the NLB Target Group's health check on the auto-assigned .spec.healthCheckNodePort and not receive any traffic.

Fig7: difference between externalTrafficPolicy

externalTrafficPolicy: Cluster

This is the default mode of operation
In this mode we see unnecessary network hops when the request is received on a node where the pod does not reside
As mentioned earlier the client ip is lost here when network hops happen because of SNAT , hence destination POD only sees the proxy node ip
In this case the traffic is forwarded/routed over the calico tunnel interface

Fig8: SNAT( Source Network Address Translation)

externalTrafficPolicy: Local

when this is used then kube-proxy applies rules only on those nodes where the backend is present !! (Refer to section What does kube-proxy do to enable service abstraction ?)
Although for this example (Fig7), the rules are applied on Node-2 only
However this introduces a new problem …. If there was a NLB placed before Node1 and Node2 , that NLB will balance traffic evenly across Node1 and Node2. This implies the Node2 only receives 50% of the traffic and both available PODS will service only 50% traffic !!

This also implies that the PODS which are active will service only 25% of the traffic.

Fig9: externalTrafficPolicy: Local traffic distribution

To avoid uneven distribution of traffic we can use pod anti-affinity (against the node’s hostname label) so that pods are spread out across as many nodes as possible:

affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      podAffinityTerm:
        labelSelector:
          matchExpressions:
           - key: k8s-app
             operator: In
             values:
             - my-app
        topologyKey: kubernetes.io/hostname

As your application scales and is spread across more nodes, imbalanced traffic should become less of a concern as a smaller percentage of traffic will be unevenly distributed:

So from above discussions externalTafficPolicy:local is a better fit for most systems when kube-proxy mode is iptables.

However the best option although still not popularly used is the kube-proxy in IPVS mode. Iptables mode is more popular as a community . As per kubernetes.io

In ipvs mode, kube-proxy watches Kubernetes Services and Endpoints, calls netlink interface to create IPVS rules accordingly and synchronizes IPVS rules with Kubernetes Services and Endpoints periodically.
This control loop ensures that IPVS status matches the desired state. When accessing a Service, IPVS directs traffic to one of the backend Pods.
The IPVS proxy mode is based on netfilter hook function that is similar to iptables mode, but uses a hash table as the underlying data structure and works in the kernel space.
That means kube-proxy in IPVS mode redirects traffic with lower latency than kube-proxy in iptables mode, with much better performance when synchronising proxy rules. Compared to the other proxy modes, IPVS mode also supports a higher throughput of network traffic.

IPVS provides more options for balancing traffic to backend Pods; these are:

rr: round-robin
lc: least connection (smallest number of open connections)
dh: destination hashing
sh: source hashing
sed: shortest expected delay
nq: never queue

Conclusion:-

We went over the path traffic takes when it hits a service on K8s cluster and how the various modes are implemented. What we didn't discuss here was ingress and how that contributes to K8s !!

References:-

Kubernetes Documentation

kubernetes.io