Overview of Azure Kubernetes Services Networking Models

Vamsi Munagala
9 min readNov 3, 2022

--

In this article, I would like to discuss AKS Networking models, specifically Kubenet and Azure CNI. I aim to explain the basics in simple language and help you understand the difference between these models.

It is not mandatory but will be helpful if you have a basic understanding of Azure networking, an overview of AKS cluster components, and Linux network namespaces.

AKS Cluster Architecture

Before diving in, let’s understand AKS Cluster’s control and worker nodes plane management:

Azure manages the Control plane components such as kube-apiserver, etcd, kube-scheduler, and kube-controller-manager. In other words, you do not have direct access to the control plane and the entire orchestration and its control-level operations are handled by Microsoft.

Whereas, you (or the customer) are going to manage worker-nodes where you create pods to run your microservices or applications.

Azure Kubernetes Cluster Architecture

AKS Networking Models

Kubernetes should abide by some networking rules as mentioned below:

  • All pods can communicate with each other on all worker nodes.
  • Agents or applications running in a node must be able to communicate with all pods on that node. For instance, kubectl or kube-proxy running on the nodes should be able to contact pods.
  • All pods within a cluster can communicate with the other pods, and all nodes can communicate with all pods in the cluster without using NAT (Network Address Translation).

There are 2 main networking plugins that I want to talk about, which are, Kubenet and Azure CNI. There are 3rd party CNI plugins such as Calico, Flannel that are not covered in this article.

One fundamental difference between these networking plugins is the ways they adopt different IP addressing schemes for their pod assignment.

Nodes can either be your virtual machines or virtual machine scale sets and they typically get their IP addresses from your existing virtual network.

As discussed above, one of the networking rules is there is NO Network Address Translation (NAT) requirement for pods which means that every pod has a unique IP address and can communicate with other pods in its cluster using the Pod’s IP address.

Kubenet

With the kubenet plugin, the nodes will get their IP addresses from your virtual network’s subnet whereas your pods will get their IP addresses from a separate unique Pod CIDR range (a separate subnet actually) that must not overlap with your nodes' subnet or on-premises network.

In other words, your pods DO NOT get IP addresses from your virtual network’s subnet.

Let us assume that we want an AKS cluster with 4 nodes using an existing VNET, say we have a Node subnet with its address prefix 192.168.1.0/24.

Node IP Assignment

After provisioning your AKS cluster with Kubenet plugin, you will see that Azure created a separate non-overlapping Pod CIDR for pods.

Kubenet Plugin: Pod CIDR

Azure carves out /24 address space from the Pod CIDR to each of its nodes for Pod assignment which means around 251 IP addresses of pods can be created but the limit set by Kubenet is 110 pods per node. So why do we need extra IP addresses for? Because it needs room for pod upgrades or scaling.

As you can see in the below diagram, Pods per node gets unique /24 of its Pod CIDR addresses.

Kubenet: Pod CIDR Assignment

Let us cover different connectivity scenarios. Before we do that, you must have a basic understanding of how Linux network namespaces work. For example, every pod gets an isolated network namespace (has its own networking stack, firewall, routing table, etc.) independent from its host’s (node) root network namespace.

Pod-to-Pod Communication in the same Node

Once pods are created in the host node, the following happens:

  1. As soon as a pod is created, container runtime (containerd) creates a network namespace, and a special hidden container called a “pause” container that holds the network namespace and enables any new containers to join the network.
  2. Kubenet handles the rest of the network management of its pod i.e. it creates a local interface (eth0), assigns an IP address for the pod, and attaches the pod’s local network (eth0) to the root network namespace (veth).
  3. Kubelet takes care of creating virtual interfaces (veth0, veth1, …) in the node’s root namespace which gets paired up with its respective pod’s local network interface (eth0) establishing connectivity between the local pods and their host node.
  4. A layer 2 virtual ethernet bridge (cbr0) in the node’s root namespace connects all these virtual interfaces (veth0, veth1, …) to facilitate communication between the pods. In the below screenshot, you will see that all veth’s are connected to the cbr0 bridge
Kubenet: cbr0 bridge

Say, a Pod1 needs to communicate with Pod2 within the same node.

Pod-to-Pod communication on the same node
  1. Since Pod2 is on the same network (10.244.0.0/24) as Pod1, Pod1 sends the packet to its default eth0. This interface is paired with its Node1’s veth0 and serves as a tunnel. And the packets are sent to the root namespace on the node.
  2. The ethernet bridge (cbr0), acting as an L2 virtual switch, checks its cached lookup table for the destination pod’s address. If it is not found, it broadcasts an ARP request on all connected devices. Pod2 responds back with its MAC address, and the bridge will update its ARP cache entries.
  3. The cbr0 bridge will do a lookup in the table, finds Pod2's MAC address, and sends it to the correct endpoint which is Pod2’s veth1.
  4. The packet reaches Pod2’s eth0 interface inside its namespace.

Container to Container communication in the same Pod

Multi-container apps (such as sidecar or init containers) run on the same pod along with other containers. Since these containers are local to a pod, they all share the same network namespace of its pod. In other words, containers running in a pod will share the network and other resources available to a pod. Thus these containers communicate with each other using localhost:<port> (or <pod-ip-address>:<port>) endpoints exposed by them.

Pod-to-Pod Communication in a different Node

Let’s consider a use case where Node 1’s Pod1 needs to communicate with Node 2’s Pod2 within the same node.

Pod-to-Pod communication on different nodes
  • Pod1 sends the packet to its default eth0. This interface is paired with its Node1’s veth0 and serves as a tunnel. And the packets are sent to the root namespace on the node.
  • Since Pod2 is on a different network (10.244.1.0/24) than Pod1 (10.244.0.0/24), the ethernet bridge (cbr0), simply forwards (IP forwarding) the request to the default eth0 without broadcasting any ARP request. [Note that IP forwarding is enabled on every node].
IP Forwarding enabled on each node
Node’s route information
  • Node 1’s Pod1 can now reach the Azure virtual network subnet but it still cannot reach Node 2’s Pod2 because the Azure virtual network is unaware of Pod’s subnet (Pod CIDR). So where else does it need to look for Pod information? Azure Route table to the rescue. This route table (UDR) is attached to the Node subnet and has defined route information on every assigned Pod CIDR prefix. As you can see in the below screenshot, the next hop for 10.244.1.0/24 is Node 2’s IP address 192.168.1.5.
Kubenet: Azure Route Table
  • The packet is routed to Node 2’s default interface which is eth0.
  • Everything goes in the reverse order from here, the packet reaches the (cbr0) bridge which issue an ARP request, identifies Pod2’s MAC address, and sends the packet to the destined endpoint which is Pod2’s eth0.

A bit deeper into SNAT

Let’s take a look at when Network Address Translation (Pod IP → Node IP) happens. For that, we have to take a look at the iptables. It includes PREROUTING (for altering packets as soon as they come in), OUTPUT (for altering locally-generated packets before routing), and POSTROUTING (for altering packets as they are about to go out). Let’s look at the POSTROUTING chain on one of our nodes.

So, if any incoming source (0.0.0.0/0) is trying to reach the Pods (10.244.0.0/16), the traffic should go to the RETURN chain which basically means that it should just go out as is. That means that when a pod sends traffic to another pod, it retains its Pod IP.

However, if the traffic is not destined for the Pods in the cluster, it should go to the MASQUERADE chain, which means that it allows all outgoing connections to go through Source NAT (SNAT), i.e. it sets the source IP to the node IP address. So if a pod has to access other resources in the Azure virtual network, it gets NAT’ed to its Node IP address.

Azure CNI

CNI (Container Network Interface) is nothing but a standard specification template with a defined set of rules in order to create a programmable plugin for creating and configuring networking namespace for containers.

The plugin uses a group of libraries and specifications (usually written in Go) that is responsible for defining an interface for configuring the network, provisioning IP addresses, and maintaining connectivity with multiple hosts.

Using the Azure CNI plugin, all pods will get real IP addresses from the node’s subnet itself. So it is very important to plan IP addresses for your cluster beforehand as you may exhaust all your IP addresses if more pods are required in the future or if the demand for your application grows.

Just as with Kubenet, Azure CNI is responsible for creating the veth pair, linking veth to the container network namespace, and assigning the IP address to the Pod (eth0).

Azure CNI: Pod IP addresses from the Node subnet

As you can see in the below image, there is NO Pod CIDR address for pod assignment.

Azure CNI: Networking

By default, we can have 30 pods per node but you can extend the limit to 250 with the help of the new dynamic IP allocation capability in Azure CNI.

These actual POD IP addresses will get placed on the NIC of the VM of the container host. In the below screenshot, you can see that 1 IP is assigned for the host, and the remaining 30 IPs are reserved for pods in that node.

Azure CNI: Node and Pod NICs

Since all the nodes and pods are directly connected to Azure VNET, there is no need for any packet translation and all packets are sent as is.

What happened to the Azure CNI (azure0) bridge network?

As you may have seen with kubenet, all the host-side pod veth pair interfaces are connected to thecbr0 bridge. So Pod-Pod intra VM communication and the remaining traffic goes through this bridge.

Let’s take a look at our Azure CNI plugin. Let’s run thebrctl show command to get all the interfaces that are attached to the node’s software bridge. There are none.

Command to show bridge information

Starting with version 1.2.0, Azure CNI sets Transparent mode as default for single tenancy Azure CNI Linux deployments. Transparent mode is replacing bridge mode.

So all host-side pod veth pair interfaces are directly added to the host network.

Azure CNI: Getting all IP Addresses on the node

So any Pod-to-Pod communication is dictated by your IP routes table which your Azure CNI takes care of. In other words, Pod-to-Pod communication is happening over layer 3 routing rules. As the traffic comes with a destination Pod IP address, it will be sent directly to the Pod’s host side veth pair interface.

Azure CNI: Transparent Mode: Node Route Table

This is actually good news as Transparent mode reduces a lot of network complexity that you find with your bridge network. It performs better in Intra Pod-to-Pod communication in terms of throughput and latency.

I hope you enjoyed this article. In the next article, I will talk about Services and Load Balancers.

--

--