Kubernetes and Networking from first principles

M Castelino
Kubehells
Published in
6 min readJan 30, 2021

Its been some time since I wrote the Kubernetes Resource Management Deep dive. In this post I wanted to reduce Kubernetes to it very bare bones ignoring all the scaffolding introduced by various controllers and higher level abstractions and focus on the building blocks with specific focus on how networking works without networking being totally within the purview of core Kubernetes.

What is the core of Kubernetes once you strip everything way. It is just the few basic constructs in etcd

  • Pod
  • Endpoint
  • Service
  • Node

Managed and accessed by

  • Kubelet
  • Kube-proxy
  • Scheduler
  • Controller
  • Apiserver
  • Coredns

Quick overview

etcd is used to create and update Pods, Endpoints, Services and Nodes. Everything else else is built on top of these core objects. etcd is always accessed via the apiserver making etcd pluggable/replaceable. Networking is not part of core Kubernetes which shows how flexible and modular it is.

  • When Kubelet starts, it registers the node in etcd with its available resources (At which point the controller may assign its Node.Spec.PodCIDR. This is optional and global IPAM can handle this independent of Kubernetes)
  • Once the controller and scheduler decide how and where to manifest a Pod it is written to etcd
  • Each Kubelet watches for Pods that have been assigned to its node and creates them (via a configured container runtime)
  • When the container runtime creates the Pod it will call into a CNI plugin to create the Pod. The CNI plugin returns back the IP address of the Pod (using either a host-local or global IPAM). Notice here the decoupling of the actual IP address management from core Kubernetes.
  • The allocated IP address is passed back to Kubelet and registered in etcd against the podIP field. Notice here that a Pod can have one IPv4 and one IPv6 address. But this does not limit how many interfaces and IPs can be within a Pod, and we will explore that a bit later.

Also note that the Cluster DNS IP setup within each Pod and is fixed to 10.96.0.10. (This is important as this allows both the DNS as well and CNI to be decoupled from core kubernetes). This also allows for inception where Coredns which needs networking created by CNI can be decoupled from CNI.

  • At this point we have a Pod with certain labels associated with it based on whom or which controller created it.
kubectl apply -f https://k8s.io/examples/application/deployment.yaml

Examining the Pods created with specific fields of interest

kubectl get pods -o json | jq -r '.items[] | [.metadata.name, .metadata.labels.app, .status.phase, .status.podIP, .status.qosClass, .spec.restartPolicy, .spec.preemptionPolicy, .spec.priority]'
[
"nginx-deployment-66b6c48dd5-2xwm2",
"nginx",
"Running",
"10.244.2.66",
"BestEffort",
"Always",
"PreemptLowerPriority",
0
]
[
"nginx-deployment-66b6c48dd5-5lrcw",
"nginx",
"Running",
"10.244.1.61",
"BestEffort",
"Always",
"PreemptLowerPriority",
0
]

Now that a Pod has an IP assigned to it independent of Kubernetes; how do all the Pods in the cluster and external entities access this Pod.

Kube-proxy is optional and pluggable and ensures traffic to and from the Pod is sent to the right Pod. Note, I mentioned Pod hence PodIP. All other network addresses are constructs out the scope of CNI.

It is also decoupled from Coredns as well as CNI. As seen above the Pod’s IP and labels are registered in etcd. This forms the foundation of everything networking.

kubectl expose deployment nginx-deployment
service/nginx-deployment exposed
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 4d19h
nginx-deployment ClusterIP 10.96.23.127 <none> 80/TCP 6s
kubectl get endpoints
NAME ENDPOINTS AGE
kubernetes 172.18.0.5:6443 4d19h
nginx-deployment 10.244.1.61:80,10.244.2.66:80 19s
kubectl get svc nginx-deployment -o json | jq -r '.spec'
{
"clusterIP": "10.96.23.127",
"ports": [
{
"port": 80,
"protocol": "TCP",
"targetPort": 80
}
],
"selector": {
"app": "nginx"
},
"sessionAffinity": "None",
"type": "ClusterIP"
}
kubectl get endpoints nginx-deployment -o json | jq -r '.metadata.managedFields[].manager, .subsets[].addresses[].ip, .subsets[].addresses[].targetRef.name'
kube-controller-manager
10.244.1.61
10.244.2.66
nginx-deployment-66b6c48dd5-5lrcw
nginx-deployment-66b6c48dd5-2xwm2

Services and Endpoints

The user/controllers normally create services which are just abstractions. The controller or user creates endpoints in etcd based on the service selector. The controller matches pods based on the service selector and creates and endpoints, and populates it using the IP’s found in the Pod’s in etcd.

In order to ensure that a Service does not expose Pods that are not ready it will wait till the Readiness and Liveness probe for the Pod to report that the Pod is ready.

Readiness and liveness probes can be used in parallel for the same container. Using both can ensure that traffic does not reach a container that is not ready for it, and that containers are restarted when they fail.

Now that we know how endpoints are populated its time to see how traffic is sent to the pod, without Kubernetes really handling networking.

Coredns and Services and Endpoints

Coredns has a plugin that registers to watch for services and endpoints. It then uses the service and endpoints to perform service to Cluster IP resolution. (Note: It may resolve a service to a list of Pod IP’s in the case of headless service). Pods use Coredns to get the IP to be used for a service.

Kube-proxy and endpoints

Kube-proxy watches etcd to look for endpoints and services. Whenever a service or endpoint is created or modified it updates the iptables/ipvs rules or any other method such that any traffic send to the Service IP is sent in a specified load balanced manner to the set of Pod IPs in the endpoint.

This also shows how Kube-proxy can be decoupled from CNI. In some cases for highly optimized traffic flows the two can be tightly coupled and share state.

The dump below show how a Cluster IP maps to PodIP using iptables setup by kube-proxy. (Quick primer on iptables iptables-cheatsheet (github.com))

-A KUBE-SERVICES ! -s 10.244.0.0/16 -d 10.96.23.127/32 -p tcp -m comment --comment "default/nginx-deployment cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ-A KUBE-SERVICES -d 10.96.23.127/32 -p tcp -m comment --comment "default/nginx-deployment cluster IP" -m tcp --dport 80 -j KUBE-SVC-WRNOD73BKRQH4VVX-A KUBE-SVC-WRNOD73BKRQH4VVX -m comment --comment "default/nginx-deployment" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-AJVPBNAAHRKWHBTA-A KUBE-SVC-WRNOD73BKRQH4VVX -m comment --comment "default/nginx-deployment" -j KUBE-SEP-SFHBWI2PEUUOP232-A KUBE-SEP-AJVPBNAAHRKWHBTA -s 10.244.1.61/32 -m comment --comment "default/nginx-deployment" -j KUBE-MARK-MASQ
-A KUBE-SEP-AJVPBNAAHRKWHBTA -p tcp -m comment --comment "default/nginx-deployment" -m tcp -j DNAT --to-destination 10.244.1.61:80

-A KUBE-SEP-SFHBWI2PEUUOP232 -s 10.244.2.66/32 -m comment --comment "default/nginx-deployment" -j KUBE-MARK-MASQ
-A KUBE-SEP-SFHBWI2PEUUOP232 -p tcp -m comment --comment "default/nginx-deployment" -m tcp -j DNAT --to-destination 10.244.2.66:80

Loadbalancers/NodePort

By default a service is created as Type ClusterIP. The other types allows routing of traffic to Pod using Loadbalancers or NodePorts.

Loadbalancers typically end up using NodePorts and in some cases limit the size of the Cluster based on the Loadbalancer limitations.

This shows how just Nodes, Pods, Services and Endpoints are used to completely achieve everything every higher level concept in Kubernetes. We also saw how the simple Pod probes are used to actively manage traffic.

Can a Kubernetes Pod have multiple interfaces and expose multiple IPs

Because of how loosely coupled networking is in Kubernetes it is possible to create a Pod with a CNI or CNI plugin chain to create multiple interfaces with their own IP addresses even though the Pod spec only support a single IP address.

Only the Primary IP associated with eth0 is registered as the PodIP. The secondary IPs associated with the Pod can be stored in additional annotations in the Pod. These annotations can then be used by a plugin controller to create endpoints that map Service IP’s to these secondary IP addresses.

As we have seen that kube-proxy is decoupled completely from CNI and Pods. Kube-proxy will setup rules for secondary IP addresses just like it does for the Primary IP addresses. As long as the networking implementation has setup host routing such that the secondary IP’s are routable either at Pod level or network fabric level traffic to and from these secondary IP’s and interfaces will flow transparently across the cluster.

This shows how the decoupling allows us to extend Kubernetes while keep the core constant.

This also allows implementations such as virtual kubelet to add nodes to the cluster as long as they are conformant to the basic kubelet interface. All the rest of the typical components we expect to see like kube-proxy, containerd can be replaced by highly node specific implementations.

--

--