K8s: A Closer Look at Kube-Proxy
The Kubernetes network proxy (aka kube-proxy) is a daemon running on each node. It basically reflects the services defined in the cluster and manages the rules to load-balance requests to a service’s backend pods.
Quick example: Let’s say we have several pods of an API microservice running in our cluster, with those replicas being exposed by a service. When a request reaches the service virtual IP, how is the request forwarded to one of the underlying pods? Well… simply by using the rules that kube-proxy created. OK, it’s not that simple under the hood, but we get the big picture here.
kube-proxy can run in three different modes:
- iptables (default mode)
- userspace (“legacy” mode, not recommended anymore)
While the iptables mode is totally fine for many clusters and workloads, ipvs can be useful when the number of services is important (more than 1,000). Indeed, as iptables rules are read sequentially, its usage can impact the routing performances if many services exist in the cluster.
Tigera (the creator and maintainer of the Calico networking solution) details the difference between the iptables and ipvs mode in this great article. It also provides a high-level comparison between those two modes.
In this article, we will focus on the iptables mode (an upcoming article will be dedicated to ipvs mode) and thus illustrate how kube-proxy defines iptables rules.
For that purpose, we will use a two-node cluster that I’ve just created using
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
k8s-1 Ready control-plane,master 57s v1.20.0
k8s-2 Ready <none> 41s v1.20.0
In the next part, we will deploy a simple application using a deployment resource and expose it through a service of type
Deploy a Sample Application
First, we create a deployment based on the ghost image (ghost is a free and open source blogging platform) and specify two replicas :
$ kubectl create deploy ghost --image=ghost --replicas=2
Next, we expose the pods using a service of type
$ kubectl expose deploy/ghost \
--port 80 \
--target-port 2368 \
Then we get the information related to this newly created service:
$ kubectl describe svc ghost
Port: <unset> 80/TCP
NodePort: <unset> 30966/TCP
Session Affinity: None
External Traffic Policy: Cluster
The important things to note here:
- The virtual IP address (VIP) allocated to the service: 10.98.141.188.
NodePort30966 has been allocated to the service. Through this port, we can access the ghost web interface from any node of the cluster (the cluster’s nodes used in this example have the IP addresses 192.168.64.35 and 192.168.64.36):
Endpointsproperty shows the IP addresses of the pods exposed by the service. In other words, each request that gets to the service’s virtual IP (10.98.141.188) on port 80 will be forwarded to one of the underlying pods’ IP (10.44.0.3 or 10.44.0.4) on port 2368 in a round-robin way.
Note: Endpoints can also be retrieved using the standard
kubectl get command:
$ kubectl get endpoints
NAME ENDPOINTS AGE
ghost 10.44.0.3:2368,10.44.0.4:2368 4m
kubernetes 192.168.64.35:6443 6m
Next, we will have a closer look into the iptables rules that kube-proxy has created to route requests towards the backend pods.
A Closer Look Into iptables
Each time a service is created/deleted or the endpoints are modified (e.g. if the number of underlying pods changes due to the scaling of the related deployment), kube-proxy is responsible for updating the iptables rules on each node of the cluster. Let’s see how this is done with the service we defined previously.
As there are quite a lot of iptables chains, we will only consider the main ones involved for the routing of a request that gets on the
NodePort and is forwarded to one of the underlying pods:
KUBE-NODEPORTS chain takes into account the packets coming on service of type
Each packet coming on port 30966 is thus first handled by the
KUBE-MARK-MASQ, which kind of tags the packet with 0x4000.
Note: This mark is only taken into account when load balancing uses the IPVS mode (and thus is not done by iptables).
Next, the packet is handled by the
KUBE-SVC-4XJR4EADNBDQKTKS chain (referenced in the
KUBE-NODEPORTS chain above). If we take a closer look at that one, we can see two additional iptables chains:
Because of the
statistic mode random probability 0.5 statement, each packet getting into the
KUBE-SVC-4XJR4EADNBDQKTKS chain is:
- Handled by
KUBE-SEP-7I5NH52DVZSA3QHP50% of the time and thus ignored 50% of the time.
- Handled by
KUBE-SEP-PSCUKR75MU2ULAEX50% of the time (when it is ignored by the first chain).
If we inspect both chains, we can see they define the routing towards one of the underlying pods running the ghost application:
Using a couple of iptables chains, we are then able to understand the journey of a request from when it gets to the node port until it reaches the underlying pod. Pretty cool, right?
In this quick article, I hope I managed to clarify the way kube-proxy works under the hood when using the iptables mode (the default one). In an upcoming article, we will see how the routing is done when the load balancing is done with the ipvs mode.