Kubernetes with Flannel — Understanding the Networking — Part 2

Anil Reddy
6 min readApr 11, 2018

--

Kubernetes networking guidelines has 3 basic principles.

  • all containers can communicate with all other containers without NAT
  • all nodes can communicate with all containers (and vice-versa) without NAT
  • the IP that a container sees itself as is the same IP that others see it as

You could achieve these objectives in many different ways. For example, Calico does it by creating a flat L3 network, Flannel does it by creating a overlay network.

As the heading states, this blog focuses on my learnings from working with flannel as CNI in my demo. [See part-1]

When the demo is spun up, you’ll see the cluster with a couple of deployments as below.

Pods deployed using github demo.

Once the demo is up, you should be able to ping between ubuntu and ubuntu2 pods. ubuntu pod will be up on worker01 and ubuntu2 pod will be up on worker02. As you can see below they can talk to each other fine.

vagrant@master:~$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
ubuntu-5846f86795-bcbqv 1/1 Running 0 4m 10.244.1.2 worker01
ubuntu2-7dd879dd6b-7gnxb 1/1 Running 0 4m 10.244.2.2 worker02
vagrant@master:~$
vagrant@master:~$ kubectl exec ubuntu-5846f86795-bcbqv -it bash
root@ubuntu-5846f86795-bcbqv:/# ping -c 3 10.244.2.2
PING 10.244.2.2 (10.244.2.2) 56(84) bytes of data.
64 bytes from 10.244.2.2: icmp_seq=1 ttl=62 time=0.825 ms
64 bytes from 10.244.2.2: icmp_seq=2 ttl=62 time=0.488 ms
64 bytes from 10.244.2.2: icmp_seq=3 ttl=62 time=0.478 ms
--- 10.244.2.2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2001ms
rtt min/avg/max/mdev = 0.478/0.597/0.825/0.161 ms
root@ubuntu-5846f86795-bcbqv:/#

Let’s see how the packet traverses from ubuntu -> ubuntu2.

1.On the container, the ping destination 10.244.2.2 is looked up.

root@ubuntu-5846f86795-bcbqv:/# ip route
default via 10.244.1.1 dev eth0
10.244.0.0/16 via 10.244.1.1 dev eth0
10.244.1.0/24 dev eth0 proto kernel scope link src 10.244.1.2
root@ubuntu-5846f86795-bcbqv:/# ip route get 10.244.2.2
10.244.2.2 via 10.244.1.1 dev eth0 src 10.244.1.2
cache
root@ubuntu-5846f86795-bcbqv:/#

10.244.0.0/16 route is matched for the query as it’s the LPM. The next hop is eth0. When the pod is spun up, a veth pair is created, one end of the veth pair is eth0 of pod, and the other end [named: vethxxx] is an interface created in root ns. Figuring out the mapping of eth0 ← → vethxxx was tricky (for me).

I logged onto the worker node (worker01) in this case.

Get the pause container's sandboxkey: 
root@worker01:~# docker inspect k8s_POD_ubuntu-5846f86795-bcbqv_default_ea44489d-3dd4-11e8-bb37-02ecc586c8d5_0 | grep SandboxKey
"SandboxKey": "/var/run/docker/netns/82ec9e32d486",
root@worker01:~#
Now, using nsenter you can see the container's information.root@worker01:~# nsenter --net=/var/run/docker/netns/82ec9e32d486 ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
3: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0a:58:0a:f4:01:02 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.244.1.2/24 scope global eth0
valid_lft forever preferred_lft forever
Identify the peer_ifindex, and finally you can see the veth pair endpoint in root namespace.
root@worker01:~# nsenter --net=/var/run/docker/netns/82ec9e32d486 ethtool -S eth0
NIC statistics:
peer_ifindex: 7
root@worker01:~#
root@worker01:~# ip -d link show | grep '7: veth'
7: veth5e43ca47@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP mode DEFAULT group default
root@worker01:~#

2. The ping packet from the container, will be out on the veth pair. This veth interface is part of the cni bridge..

root@worker01:~# brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.0a580af40101 no veth5e43ca47
root@worker01:~#
root@worker01:~# ip -d -4 addr show cni0
6: cni0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default qlen 1000
link/ether 0a:58:0a:f4:01:01 brd ff:ff:ff:ff:ff:ff promiscuity 0
bridge forward_delay 1500 hello_time 200 max_age 2000 ageing_time 30000 stp_state 0 priority 32768 vlan_filtering 0 vlan_protocol 802.1Q
inet 10.244.1.1/24 scope global cni0
valid_lft forever preferred_lft forever
root@worker01:~#

pkt put out on wire on vethxxx:

root@worker01:~# tcpdump -vv -ni veth5e43ca47 icmp
tcpdump: listening on veth5e43ca47, link-type EN10MB (Ethernet), capture size 262144 bytes
23:10:41.314045 IP (tos 0x0, ttl 64, id 53255, offset 0, flags [DF], proto ICMP (1), length 84)
10.244.1.2 > 10.244.2.2: ICMP echo request, id 35, seq 42, length 64

3. The packet destined to 10.244.2.2 will further get looked up on the root netns, and will see the following route.

root@worker01:~# ip route
default via 10.0.2.2 dev enp0s3
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
172.16.1.0/24 dev enp0s8 proto kernel scope link src 172.16.1.101
root@worker01:~#
The route for 10.244.2.2 is an onlink route. So, it'll be looked up in the neigh table.root@worker01:~# ip neigh show 10.244.2.0
10.244.2.0 dev flannel.1 lladdr 06:f5:5b:c5:a4:c9 PERMANENT
root@worker01:~# bridge fdb show | grep 06:f5:5b:c5:a4:c9
06:f5:5b:c5:a4:c9 dev flannel.1 dst 172.16.1.102 self permanent
root@worker01:~#
root@worker01:~# ip -d link show flannel.1
5: flannel.1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default
link/ether da:10:15:d7:71:fb brd ff:ff:ff:ff:ff:ff promiscuity 0
vxlan id 1 local 172.16.1.101 dev enp0s8 srcport 0 0 dstport 8472 nolearning ageing 300 addrgenmode eui64
root@worker01:~#
If you see the bridge entry, it could figure out that the remote VTEP for the packets destined for 10.244.2.2 is 172.16.1.102. Question: How did this bridge entry ever end up on the worker node?
We're not using etcd to store public-ip/pod-cidr in our
flannel manifest file.
Answer lies in the flannel manifest file. Flannel pod has RBAC access to read/patch node. When a new node joins the cluster, kube-flannel will be launched on it as it's a daemon-set. When a new node joins the cluster, it'll publish it's own publicIP/VtepMac as annotations. And, each node can read this information and the fdb entry is programmed.
vagrant@master:~$ kubectl describe node worker01 | grep -A3 Annotations
Annotations: flannel.alpha.coreos.com/backend-data={"VtepMAC":"da:10:15:d7:71:fb"}
flannel.alpha.coreos.com/backend-type=vxlan
flannel.alpha.coreos.com/kube-subnet-manager=true
flannel.alpha.coreos.com/public-ip=172.16.1.101

4. And now this packet gets encapsulated and sent out to remote VTEP 172.16.1.102.

pkt put out on enp0s8 on wire as you can see is encapsulated: the port 8472 comes from flannel. See (ip -d link show flannel.1)

root@worker01:~# tcpdump -vv -ni enp0s8 udp
tcpdump: listening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes
23:15:17.390659 IP (tos 0x0, ttl 64, id 27140, offset 0, flags [none], proto UDP (17), length 134)
172.16.1.101.42693 > 172.16.1.102.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
IP (tos 0x0, ttl 63, id 23013, offset 0, flags [DF], proto ICMP (1), length 84)
10.244.1.2 > 10.244.2.2: ICMP echo request, id 35, seq 318, length 64

5, 6, 7. Packet gets decapsulated, and the now it looks for 10.244.2.2 in worker02’s rootns. And finally it’ll reach the pod via the veth pair.

vagrant@worker02:~$ ip route
default via 10.0.2.2 dev enp0s3
10.0.2.0/24 dev enp0s3 proto kernel scope link src 10.0.2.15
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.2.0/24 dev cni0 proto kernel scope link src 10.244.2.1
172.16.1.0/24 dev enp0s8 proto kernel scope link src 172.16.1.102
vagrant@worker02:~$ ip route get 10.244.2.2
10.244.2.2 dev cni0 src 10.244.2.1
cache
vagrant@worker02:~$
vagrant@worker02:~$ brctl show
bridge name bridge id STP enabled interfaces
cni0 8000.0a580af40201 no veth178c3c15
vagrant@worker02:~$
And reaches the pod via the veth pair.

Flannel IPAM:

Flannel doesn’t allocate and maintain the IP addresses to pods, “host-local” CNI plugin is responsible for this.

If you’re curious, the lease information is stored in: /var/lib/cni/networks/cni0/

root@worker02:/var/lib/cni/networks/cni0# ls
10.244.2.2 last_reserved_ip.0
root@worker02:/var/lib/cni/networks/cni0# cat 10.244.2.2
affa1a0cedc8ba36399fcff60ad509dde3fa3808641e19614d365f48703f5db9
And, when you inspect the docker container, you could see it all tied together.root@worker02:/var/lib/cni/networks/cni0# docker inspect k8s_ubuntu2_ubuntu2-7dd879dd6b-7gnxb_default_ea4ad6f0-3dd4-11e8-bb37-02ecc586c8d5_0 | grep NetworkMode
"NetworkMode": "container:affa1a0cedc8ba36399fcff60ad509dde3fa3808641e19614d365f48703f5db9",
root@worker02:/var/lib/cni/networks/cni0#

Thanks for reading! Please feel free to comment if something’s off, and I’ll learn and correct it. Adios!

--

--