Kubernetes Networking on AWS, Part I

Published in

Elotl blog

9 min readJul 2, 2019

One of the more complex parts of Kubernetes, with a somewhat steep learning curve, is networking. There are several ways and network plugins to implement a scalable and performant network in Kubernetes. To make an informed decision on which one(s) to use, it helps to understand first the underlying architecture in Kubernetes, and second how the various network plugins fit in, so you will be able to choose the right tool for your environment and use case.

Here we will first look at how networking in Kubernetes works, and then a simple implementation on AWS.

In the next part of this series of blog posts we will also check out another plugin better suited for clusters running on AWS. Finally, we will also explain how networking for nodeless Kubernetes works with our nodeless runtime, and how interoperability is achieved in a mixed environment, when both regular kubelet workers and our nodeless runtime are present in the cluster.

Kubernetes Networking Model

There are a few well-articulated requirements for various networking scenarios in Kubernetes:

Container to container in the same pod: this is easy. Containers in the same pod share the same network namespace, and can communicate directly via localhost. The container runtime needs to ensure that containers in the same pod “see” the same network (share their network namespace), and that they have a working loopback interface in their network namespace.
Pod to service communication. Kubernetes uses a special IP range for services, and traffic coming from pods hitting these IPs will get load balanced across the backend pods. The kube-controller flag --service-cluster-ip-range can be used to configure this service CIDR in Kubernetes. Kubernetes uses a service proxy (usually kube-proxy) that redirects traffic going to service cluster IPs to backend pods. When you check a service spec in Kubernetes, clusterIP is this virtual IP address of the service.
Pod to pod communication. Every pod in Kubernetes has an IP address that is unique in the cluster. Pods can communicate directly with each other and with nodes in the cluster, without the source or destination IP address getting changed. The IP address they see as their address (when querying their own IP address from the operating system) is the same other pods see them coming from. The address space pods get their IP addresses from is called the pod CIDR or the cluster CIDR. You can configure this CIDR via the kube-controller flag --cluster-cidr.

The first scenario is straightforward and somewhat out of scope for this post. We will assume that whatever network plugin is used correctly sets up a loopback interface in the network namespace of the pod.

The second one relies on a service proxy, usually kube-proxy, though there are other service proxy implementations too. Kube-router is another implementation, which can also act as a network plugin and a network policy controller.

The third scenario, and the one we are focusing on in this post, is about enabling pod networking, both for pods running on the same worker, and for pods running on different workers.

Pod Network Plugins in Kubernetes

The internal kubelet API for pod networking is a simple interface that the kubelet calls during various pod lifecycle events to configure networking for the pod.

You can select a pod network plugin via the kubelet flag --network-plugin. If no plugin is selected, the “noop” plugin is used, which does not do much: it just ensures the service proxy works when a bridge interface is used for pods, via setting the net.bridge.bridge-nf-call-iptables sysctl to 1 (since the service proxy likely uses iptables rules for redirecting service cluster IP traffic). This plugin might be a good choice for a setup with one worker only, since the docker bridge interface is sufficient in this case, and we don’t have to worry about cross-node pod network traffic.

A basic network plugin with good performance is “kubenet”, which is used when the kubelet is started with the --network-plugin=kubenet command line argument.

The modern and recommended approach for implementing pod networking is via a CNI plugin, when the kubelet is started with --network-plugin=cni. CNI (Container Network Interface) is a specification for configuring network interfaces in Linux containers. Even though we hear about CNI in the context of Kubernetes, it is not Kubernetes-specific. Container runtimes and other container orchestration frameworks also use CNI. CNI also incorporates the IPAM specification, for managing IP addresses for containers.

Let’s look at kubenet first.

How Kubenet Works

The kube-controller-manager has a controller loop that allocates per-node CIDRs from the cluster CIDR for pod networking. The prefix size for these CIDRs can be configured via the --node-cidr-mask-size kube-controller-manager flag. Allocation can be enabled or disabled via the --allocate-node-cidrs flag, and the algorithm for splitting the cluster CIDR up into chunks can be configured via --cidr-allocator-type. As we already mentioned, --cluster-cidr configures the cluster CIDR.

For example, if the cluster CIDR is 172.20.0.0/16, and the node CIDR mask size is 24 (the default value), the controller will allocate 172.20.0.0/24 for the first kubelet node, 172.20.1.0/24 for the second one, 172.20.2.0/24 for the third one, and so on. The field Node.Spec.PodCIDR will be set for this CIDR for each node. Note, these CIDRs are different from the actual physical (or in case of AWS, the VPC) network where the Kubernetes master(s) and worker(s) reside.

The kubelet will also set up an iptables MASQUERADE rule for pod network traffic leaving the pod network; that is, when a pod sends packets to a destination that is outside the pod network (172.20.0.0/16 in the example above), the traffic will be NATed. This way, pods can communicate with the internet, or services that run outside of the cluster. The way to configure this is via the --non-masquerade-cidr <cidr> kubelet flag; however, this flag is now deprecated, and it is recommended to use ipmasq-agent instead (which is more flexible and able to handle multiple CIDRs). Depending on your network, NATing pod network traffic leaving the pod network might or might not be necessary; setting --non-masquerade-cidr to “0.0.0.0/0” and disabling ipmasq-agent will disable NAT for this type of traffic.

Another piece of the puzzle is cross-node routing (when podA wants to communicate with podB, and they run on different worker nodes). The kubelet will act as a router for its node CIDR, however, kubenet itself does not set up routes. This is another task that the kube-controller-manager can handle, if --configure-cloud-routes is enabled (and --allocate-node-cidrs is also true), via the AWS cloud provider. There are a few limitations, though, when it comes to routes on AWS:

VPCs have limits on the number of routes per routing table, by default 50. This can be raised, but routing performance might suffer with a large number of route table entries. This effectively limits the number of nodes in a cluster on AWS if kubenet is used.
The AWS cloud provider can only handle one routing table. If you need to use multiple routing tables for your cluster, then kubenet is not an option.

As for the actual implementation, kubenet creates a bridge interface (with the name “cbr0”) on the worker node, and for each container, a veth interface pair. The veth interface outside of the container will be added to the bridge, and the other one will be moved into the container’s network namespace.

All the veth interfaces inside containers on a particular worker node will get an IP address from the CIDR allocated to the node (Node.Spec.PodCIDR; remember, this CIDR is allocated from the cluster CIDR).

The veth interface pair and the kubenet bridge

Under the hood, the kubenet plugin relies on the Go CNI library for configuring networking. It uses the “bridge” CNI plugin for creating the bridge interface and adding the veth interfaces to it, the “host-local” CNI plugin for managing the local node CIDR, and the “loopback” CNI plugin for providing a loopback interface to the container.

Pod network traffic will traverse the bridge, and depending on the destination:

It will be routed via the main network interface of the worker node (for example, eth0) if it is sent to a pod on another worker node or to a service outside the cluster. See the note above on NAT when the destination is not in the pod network.
Or, if the destination is another pod that is running on the same worker node, it will be sent directly to the right container veth interface.

One last note on kubenet: if Kubernetes network policy support is a requirement, then kubenet is not an option, since it does not support this feature. You will have to use a CNI plugin that comes with network policy support, such as kube-router.

Creating a Pod Network with Kubenet on AWS

Let’s take a look at how this works in practice, and create a Kubernetes cluster with two workers.

We will need a few things:

A VPC with a subnet.
One EC2 instance running the Kubernetes control plane (that is, a master node).
Two EC2 instances used as workers, running Ubuntu 16.04.
Kubeadm for provisioning Kubernetes. For more information on the requirements for kubeadm see https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/install-kubeadm/.

As for the network, we will use the following address spaces:

10.0.0.0/16 for the VPC, 10.0.1.0/24 for the subnet.
172.20.0.0/16 for the pod CIDR (pods will get their IP address from this range).
10.96.0.0/12 for the cluster service CIDR (services will get a virtual IP from this range).

It is recommended that you use some kind of configuration management or infrastructure management tool when setting up a cluster. For a working example, see https://github.com/elotl/kubeadm-aws. Below we will go through the steps with kubeadm manually.

Install the necessary packages on all EC2 instances:

curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -cat <<EOF > /etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-xenial main
EOFapt-get update -qq; apt-get install -qq -y kubelet kubeadm kubectl kubernetes-cni docker.io

Generate a token for kubeadm (on any node):

python -c 'import random; print "%0x.%0x" % (random.SystemRandom().getrandbits(3*8), random.SystemRandom().getrandbits(8*8))'b95c19.64db1b9210d46ce0

Start the control plane on the master via kubeadm:

export name="$(hostname -f)"
export pod_cidr="172.20.0.0/16"
export service_cidr="10.96.0.0/12"cat <<EOF > /tmp/kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  token: ${k8stoken}
nodeRegistration:
  name: $name
  kubeletExtraArgs:
    cloud-provider: aws
    network-plugin: kubenet
    non-masquerade-cidr: 0.0.0.0/0
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
networking:
  podSubnet: ${pod_cidr}
  serviceSubnet: ${service_cidr}
apiServer:
  extraArgs:
    enable-admission-plugins: DefaultStorageClass,NodeRestriction
    cloud-provider: aws
controllerManager:
  extraArgs:
    cloud-provider: aws
    configure-cloud-routes: "true"
    address: 0.0.0.0
EOFkubeadm init --config=/tmp/kubeadm-config.yamlexport KUBECONFIG=/etc/kubernetes/admin.conf; echo "export KUBECONFIG=$KUBECONFIG" >> ~/.bashrc

Check your master IP:

ifconfig eth0eth0      Link encap:Ethernet  HWaddr 06:71:8a:8e:8d:fc
          inet addr:10.0.1.232  Bcast:10.0.1.255  Mask:255.255.255.0
          inet6 addr: fe80::471:8aff:fe8e:8dfc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9001  Metric:1
          RX packets:236782 errors:0 dropped:0 overruns:0 frame:0
          TX packets:42514 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:343471375 (343.4 MB)  TX bytes:3593383 (3.5 MB)

Set up ip-masq-agent on the master:

export pod_cidr="172.20.0.0/16"
export subnet_cidrs="10.0.1.0/24"mkdir -p /tmp/ip-masq-agent-config; cat <<EOF > /tmp/ip-masq-agent-config/config
nonMasqueradeCIDRs:
  - ${pod_cidr}
$(for subnet in ${subnet_cidrs}; do echo "  - $subnet"; done)
EOFkubectl create -n kube-system configmap ip-masq-agent --from-file=/tmp/ip-masq-agent-config/configkubectl apply -f https://raw.githubusercontent.com/kubernetes-incubator/ip-masq-agent/master/ip-masq-agent.yamlkubectl patch -n kube-system daemonset ip-masq-agent --patch '{"spec":{"template":{"spec":{"tolerations":[{"effect":"NoSchedule","key":"node-role.kubernetes.io/master"}]}}}}'

Have the workers join the cluster. Don’t forget to update the value of apiServerEndpoint and token in the config file:

export name="$(hostname -f)"cat <<EOF > /tmp/kubeadm-config.yaml
apiVersion: kubeadm.k8s.io/v1beta1
kind: JoinConfiguration
discovery:
  bootstrapToken:
    token: b95c19.64db1b9210d46ce0
    unsafeSkipCAVerification: true
    apiServerEndpoint: 10.0.1.232:6443
nodeRegistration:
  name: $name
  kubeletExtraArgs:
    cloud-provider: aws
    network-plugin: kubenet
    non-masquerade-cidr: 0.0.0.0/0
    node-labels: kubernetes.io/role=worker
EOFkubeadm join --config=/tmp/kubeadm-config.yaml

Docker sets the policy for the FORWARD chain to DROP, change it back on all nodes:

iptables -P FORWARD ACCEPT

The cluster should be up and running at this point:

kubectl get nodesNAME                         STATUS   ROLES    AGE   VERSION
ip-10-0-1-232.ec2.internal   Ready    master   94s   v1.15.0
ip-10-0-1-71.ec2.internal    Ready    worker   67s   v1.15.0
ip-10-0-2-167.ec2.internal   Ready    worker   54s   v1.15.0

Each worker has got a /24 CIDR from the 172.20.0.0/16 pod CIDR:

kubectl get nodes -ojsonpath='{.items[*].spec.podCIDR}'172.20.0.0/24 172.20.1.0/24 172.20.2.0/24

To check that pod-to-pod networking works, launch an echo server:

kubectl run echoserver --image=k8s.gcr.io/echoserver:1.4 --replicas=2

Check the pod IPs:

kubectl get podsNAME                          READY   STATUS    RESTARTS   AGE
echoserver-7697bd9fb5-q9zmd   1/1     Running   0          106s
echoserver-7697bd9fb5-r7zwg   1/1     Running   0          106skubectl get pods -ojsonpath='{.items[*].status.podIP}'172.20.1.2 172.20.2.2

And test connectivity:

kubectl exec -ti echoserver-7697bd9fb5-q9zmd bashroot@echoserver-7697bd9fb5-q9zmd:/# apt-get update -qq; apt-get install -qq -y net-tools netcat

root@echoserver-7697bd9fb5-q9zmd:/# ifconfig eth0eth0      Link encap:Ethernet  HWaddr 5e:c0:ad:8c:bd:9e
          inet addr:172.20.1.2  Bcast:0.0.0.0  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:11267 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7753 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:28559934 (28.5 MB)  TX bytes:517349 (517.3 KB)root@echoserver-7697bd9fb5-q9zmd:/# nc -v 172.20.2.2 8080
ip-172-20-2-2.ec2.internal [172.20.2.2] 8080 (?) open
^C

Conclusion

Kubenet is a simple network plugin that is a great starting point for creating a network in a small to medium sized Kubernetes cluster on AWS, where the control plane also takes care of allocating pod IP address ranges to nodes and setting up cloud routes. In our next post, we’ll show how to implement a more scalable pod network with the Amazon VPC CNI plugin.