This post is going to attempt to demystify the several layers of networking operating in a kubernetes cluster. Kubernetes is a powerful platform embodying many intelligent design choices, but discussing the way things interact can get confusing: pod networks, service networks, cluster IPs, container ports, host ports, node ports… I’ve seen a few eyes glaze over. We mostly talk about these things at work, cutting across all layers at once because something is broken and someone wants it fixed. If you take it a piece at a time and get clear on how each layer works it all makes sense in a rather elegant way.
In order to keep things focused I’m going to split the post into three parts. This first part will look at containers and pods. The second will examine services, which are the abstraction layer that allows pods to be ephemeral. The last post will look at ingress and getting traffic to your pods from outside the cluster. A few disclaimers first. This post isn’t intended to be a basic intro to containers, kubernetes or pods. To learn more about how containers work see this overview from Docker. A high level overview of kubernetes can be found here, and an overview of pods specifically is here. Lastly a basic familiarity with networking and IP address spaces will be helpful.
So what is a pod? A pod consists of one or more containers that are collocated on the same host, and are configured to share a network stack and other resources such as volumes. Pods are the basic unit kubernetes applications are built from. What does “share a network stack” actually mean? In practical terms it means that all the containers in a pod can reach each other on localhost. If I have a container running nginx and listening on port 80 and another container running scrapyd the second container can connect to the first as http://localhost:80. But how does that really work? Lets look at the typical situation when we start a docker container on our local machine:
From the top down we have a physical network interface eth0. Attached to that is a bridge docker0, and attached to that is a virtual network interface veth0. Note that docker0 and veth0 are both on the same network, 172.17.0.0/24 in this example. On this network docker0 is assigned 172.17.0.1 and is the default gateway for veth0, which is assigned 172.17.0.2. Due to how network namespaces are configured when the container is launched processes inside it see only veth0, and communicate with the outside world through docker0 and eth0. Now let’s launch a second container:
As shown above the second container gets a new virtual network interface veth1, connected to the same docker0 bridge.* This interface is assigned 172.17.0.3, so it is on the same logical network as the bridge and the first container, and both containers can communicate through the bridge as long as they can discover the other container’s IP address somehow.
[ * Dan Nissenbaum pointed out that this description omits some detail. For background see our brief discussion at the end of the post.]
[ * 12/15/2018: my previous update really got the lower levels of this wrong. The connection between a container and the bridge is established over a pair of linked virtual ethernet devices, one in the container network namespace and the other in the root network namespace. For a great overview of this subject see Kristen Jacobs’ awesome “Container Networking From Scratch” talk given this week at Kubecon 2018 in Seattle (slide link at the bottom). I really enjoyed Kristen’s presentation and I plan to shamelessly copy it in a future post on container networking.]
[ * 9/9/2019: this article by Ifeanyi Ubah does a great job of exploring linux network namespaces and veth pairs]
That’s fine and all but it doesn’t get us to the “shared network stack” of a kubernetes pod. Fortunately namespaces are very flexible. Docker can start a container and rather than creating a new virtual network interface for it, specify that it shares an existing interface. In this case the drawing above looks a little different:
Now the second container sees veth0 rather than getting its own veth1 as in the previous example. This has a few implications: first, both containers are addressable from the outside on 172.17.0.2, and on the inside each can hit ports opened by the other on localhost. This also means that the two containers cannot open the same port, which is a restriction but no different than the situation when running multiple processes on a single host. In this way a set of processes can take full advantage of the decoupling and isolation of containers, while at the same time collaborating together in the simplest possible networking environment.
Kubernetes implements this pattern by creating a special container for each pod whose only purpose is to provide a network interface for the other containers. If you ssh in to a kubernetes cluster node that has pods scheduled on it and run
docker ps you will see at least one container that was started with the
pause command. The
pause command suspends the current process until a signal is received so these containers do nothing at all except sleep until kubernetes sends them a SIGTERM. Despite this lack of activity the “pause” container is the heart of the pod, providing the virtual network interface that all the other containers will use to communicate with each other and the outside world. So in a hypothetical pod-like thing the last picture sort of looks like this:
The Pod Network
That’s all pretty cool, but one pod full of containers that can talk to each other does not get us a system. For reasons that will become even clearer in the next post where I discuss services, the very heart of kubernetes’ design requires that pods be able to communicate with other pods whether they are running on the same local host or separate hosts. To look at how that happens we need to step up a level and look at nodes in a cluster. This section will contain some unfortunate references to network routing and routes, a subject I realize all of humanity would prefer to avoid. Finding a clear, brief tutorial on IP routing is difficult, but if you want a decent review wikipedia’s article on the topic isn’t horrible.
A kubernetes cluster consists of one or more nodes. A node is a host system, whether physical or virtual, with a container runtime and its dependencies (i.e. docker mostly) and several kubernetes system components, that is connected to a network that allows it to reach other nodes in the cluster. A simple cluster of two nodes might look like this:
If you’re running your cluster on a cloud platform like GCP or AWS that drawing pretty well approximates the default networking architecture for a single project environment. For the purposes of illustration I’ve used the private network 10.100.0.0/24 for this example, so the router is 10.100.0.1 and the two instances are 10.100.0.2 and 10.100.0.3 respectively. Given this setup each instance can communicate with the other on eth0. That’s great, but recall that the pod we looked at above is not on this private network: it’s hanging off a bridge on a different network entirely, one that is virtual and exists only on a specific node. To make this clearer let’s drop our pod-like things back into the picture:
The host on the left has interface eth0 with an address of 10.100.0.2, whose default gateway is the router at 10.100.0.1. Connected to that interface is bridge docker0 with an address of 172.17.0.1, and connected to that is interface veth0 with address 172.17.0.2. The veth0 interface was created with the pause container and is visible inside all three containers by virtue of the shared network stack. Because of local routing rules set up when the bridge was created any packet arriving at eth0 with a destination address of 172.17.0.2 will be forwarded to the bridge, which will then send it on to veth0. Sounds ok so far. If we know we have a pod at 172.17.0.2 on this host we can add rules to our router setting the next hop for that address to 10.100.0.2 and they will get forwarded from there to veth0. Dandy! Now let’s look at the other host.
The host on the right also has eth0, with an address of 10.100.0.3, using the same default gateway at 10.100.0.1, and again connected to it is the docker0 bridge with an address of 172.17.0.1. Hmm. That’s going to be an issue. Now this address might not actually be the same as the other bridge on host 1. I’ve made it the same here because that’s a pathological worst case, and it might very well work out that way if you just installed docker and let it do its thing. But even if the chosen network is different this highlights the more fundamental problem which is that one node typically has no idea what private address space was assigned to a bridge on another node, and we need to know that if we’re going to send packets to it and have them arrive at the right place. Clearly some structure is required.
Kubernetes provides that structure in two ways. First, it assigns an overall address space for the bridges on each node, and then assigns the bridges addresses within that space, based on the node the bridge is built on. Secondly it adds routing rules to the gateway at 10.100.0.1 telling it how packets destined for each bridge should be routed, i.e. which node’s eth0 the bridge can be reached through. Such a combination of virtual network interfaces, bridges, and routing rules is usually called an overlay network. When talking about kubernetes I usually call this network the “pod network” because it is an overlay network that allows pods to communicate back and forth on any node. Here is the drawing above with kubernetes in play:
One thing that should jump out is that I’ve changed the name of the bridges from “docker0” to “cbr0.” Kubernetes does not use the standard docker bridge device and in fact “cbr” is short for “custom bridge.” I don’t know everything that is custom about it, but it is one of the important differences between docker running on kubernetes vs. a default installation. Another thing to note is that the address space assigned to the bridges in this example is 10.0.0.0/14. That’s taken from one of our staging clusters in Google Cloud and so is a real-world example. Your cluster may be assigned an entirely different range. Unfortunately there is no way to expose this using the
kubectl utility at the moment, but on GCP you can run
gcloud container clusters describe <cluster> command and look for the
Generally you won’t need to think about how this network functions. When a pod talks with another pod it most often does so through the abstraction of a service, a kind of software-defined proxy that will be the subject of the next post in this series. But pod network addresses will pop up in logs and when debugging and in some scenarios you may need to explicitly route this network. For example traffic leaving a kubernetes pod bound for any address in the 10.0.0.0/8 range is not NAT’d by default, so if you communicate with services on another private network in that range you may need to set up rules to route the packets back to the pods. Hopefully this article will help you take the right steps to make such scenarios work correctly.
[Update] part 2 has been posted: