Kubernetes: from load balancer to pod
At work we use Kubernetes. Last Friday I was talking to a colleague and we were wondering how load balancers, services and pods worked all together. Actually, everything is pretty well explained in the Services, Load Balancing, and Networking section from the Kubernetes concepts. However, you probably need a couple of reads to make sense of everything so I really needed to see it for myself and play with an example. The basic question I was trying to answer was: what happens when you define a service as a load balancer and how do packets end up in my pod?
So, let’s start with the example. Suppose we have this service defined:
kind: Service
apiVersion: v1
metadata:
name: cloud-nginx
spec:
type: LoadBalancer
selector:
app: cloud-nginx
ports:
- port: 80
name: http-server
- port: 443
name: https-server
This is a LoadBalancer service named cloud-nginx. The pods targeted by this service are also named cloud-nginx (see spec.selector.app: cloud-nginx) and the pods will be targeted at ports 80 and 443. Port 80 basically redirects (with a 301 HTTP return code) to port 443. Our LoadBalancer service will automatically create a couple of things: a cluster IP (only accessible inside the Kubernetes cluster) and a service node port. A service node port is exposed in every node in the cluster. This is important.
The load balancer is external to the cluster, which means it will have an external IP and it will forward packets to the service node ports created above.
When the packets reach the node (before nodes were called minions) it all depends on what kind of kube-proxy we are using. There are two modes: userspace or iptables. In our case we use userspace as we can see from the command running on the node:
# ps auxfww | grep kube-proxy
...
kube-proxy --master=https://30.191.106.12 --kubeconfig=/var/lib/kube-proxy/kubeconfig --cluster-cidr=10.244.0.0/14 --resource-container= --v=2 --proxy-mode=userspace 1>>/var/log/kube-proxy.log 2>&1
...
In a userspace proxy, kube-proxy opens a proxy port and installs iptables rules that redirect traffic from the service node port to the proxy port. This means that both ports are opened by kube-proxy (ports 31234 and 32855 below are explained further down):
# lsof -i TCP:31234 -i TCP:32855
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
kube-prox 3822 root 22u IPv6 272416 0t0 TCP *:31234 (LISTEN)
kube-prox 3822 root 23u IPv6 269708 0t0 TCP *:32855 (LISTEN)
When packets reach the service node ports, then new iptables rules kick in. These are the iptables rules setup by kube-proxy I mentioned above. So, in our case, packets that reached the external load balancer at port 80 will go to one of our nodes (i.e. to the service node port), once it reaches the node the iptables rules will redirect packets to the proxy port.
Inside iptables we are interested in the nat table. You can list all the nat rules with:
# iptables -t nat -L -n
Then you can look for particular rules. For example, in one of the nodes of my cluster there are these iptables rules:
# iptables -t nat -S KUBE-NODEPORT-CONTAINER
...
-A KUBE-NODEPORT-CONTAINER -p tcp -m comment — comment “default/cloud-nginx:http-server” -m tcp — dport 31234 -j REDIRECT — to-ports 32855
...
which tells to redirect all local packets from port 31234 to local port 32855.
# iptables -t nat -S KUBE-NODEPORT-HOST
...
-A KUBE-NODEPORT-HOST -p tcp -m comment — comment “default/cloud-nginx:http-server” -m tcp — dport 31234 -j DNAT — to-destination 10.240.0.4:32855
...
which rewrites and routes all packets that go to port 31234 to the internal cluster IP of the node at port 32855.
If you connect to either 31234 or 32855, you can see that it works:
# nc localhost 31234
GET /
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor=”white”>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.11.5</center>
</body>
</html>
What are ports 31234 and 32855 used in iptables? Port 31234 is the service node port and 32855 is the proxy port. We can also see the mapping from the load balancer on port 80 to the service node port 31234 using kubectl:
$ kubectl get services cloud-nginx
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
cloud-nginx 10.0.226.161 101.195.33.220 80:31234/TCP,443:30582/TCP 1d
Going back to iptables, we can see another interesting rule:
# iptables -t nat -S KUBE-PORTALS-HOST
...
-A KUBE-PORTALS-HOST -d 101.195.33.220/32 -p tcp -m comment — comment “default/cloud-nginx:http-server” -m tcp — dport 80 -j DNAT — to-destination 10.240.0.4:32855
....
which rewrites the packets from my load balancer (101.195.33.220) to the proxy port 32855 directly (no need to go through the service node port).
A final question still remains to be answered, how does kube-proxy forward packets from port 32855 to the cloud-nginx container? The missing piece of the puzzle is that each Kubernetes pod has its own IP and kube-proxy knows about them. For our purposes, we can get the pod IP with kubectl:
$ kubectl get po cloud-nginx-3tvcp -o wide
NAME READY STATUS RESTARTS AGE IP NODE
cloud-nginx-3tvcp 1/1 Running 0 2d 10.244.1.5 aleix-minion-group-7f52
This means that kube-proxy will basically forward all packets from port 32855 to 10.244.1.5 at port 80 (which is the port specified in our service file). And from the node we can verify we can access the pod IP to the desired port:
# nc 10.244.1.5 80
GET /
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor=”white”>
<center><h1>301 Moved Permanently</h1></center>
<hr><center>nginx/1.11.5</center>
</body>
</html>
So the whole process would look something like this from a high level perspective:
Above I mentioned that port 31234 was also handled by kube-proxy but as we can see in the diagram we don’t really use it in terms of kube-proxy. If I read correctly in the kube-proxy code this is to make sure the port is reserved so no other application can get it.
And I believe that’s it. There’s the other kube-proxy mode with iptables which performs better, I believe it’s because the packets don’t need to go to userspace. However, since it’s all iptables based if a packet can’t reach the pod (e.g. because the pod is down) it will not do retries as there is no such thing in iptables. But that’s for another day.