A closer look at Networking with Kubernetes !

Abhishek Mitra
8 min readMar 28, 2019

--

The following article is an attempt to look at what happens under the cover when pods try to talk to each other across nodes. Although this topic has been discussed in various forums, here we have tried to break down the mechanisms of tracing the packet path through the pods and hosts and the CNI plugins as well.

I have taken the example of flannel here. Iwill talk about Calico in my future posts .

Lab Setup for K8 Cluster
Physical Host Connectivity
Virtual Machine Connectivity

Topology Explanation :-

  1. 3 ESX Hosts in same subnet (VLAN236,172.26.236.0/24, GW- 172.26.236.1)
  2. On each ESX Host is a VM, acting as a Node of the Kubernetes Cluster
  3. VM01 and VM02 (running on Esx Hosts 2 and 3 respectively) are on the same subnet (VLAN161, 192.161.0.0/24) while VM03 is on a different subnet (VLAN160,192.160.0.0/16)
  4. There is a router that enable communication between these nodes (the router on the right in the diagram above)
  5. The second router provides connectivity to the internet for the physical hosts
  6. There are a total of 3 levels of switches here :-

a. The external real switch to which the hosts are connected

b. The virtual switch inside each host which helps in transparent switching of traffic originating from VM’s (vmware vSwitch)

c. The linux bridge inside each host that facilitates moving traffic between PODs!

K8 Node Distribution

Packet Path between 2 PODS on different Hosts

Before we begin let us consider how the bridge and pods look like in a single worker node

Typical POD to bridge networking
Distribution of the pods on the system.

As we see above, each pod is on a different kubernetes worker node. We will be focusing on pod “nginx-7db9fccd9b-99z4c” on “VM02” and “nginx-7db9fccd9b-q4cwz” running on “VM03".

Note: VM02 and VM03 are on different subnets (192.161.1.0/24,192.160.0.0/16)

  1. POD1 nginx-7db9fccd9b-99z4c has a mac of d6:c5:dd:f3:64:36 and IP of 10.244.1.3. This pod is connected to the bridge(cni0) on VM02 . To get more clarity on the packet path we can find out which interface eth0 of POD maps on bridge(cbr0)

as seen from the POD

kubectl exec nginx-7db9fccd9b-99z4c -it — ethtool -S eth0
NIC statistics:
peer_ifindex: 7

as seen on VM02

ip addr shows the nic index

7: veth47db03a0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master cni0 state UP group default
link/ether 56:f3:89:aa:52:8c brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::54f3:89ff:feaa:528c/64 scope link
valid_lft forever preferred_lft forever

This interface is connected to the bridge cni0

Bridge map on VM02

So in essence connectivity of POD1 on VM02 is

d6:c5:dd:f3:64:36 (eth0) ==> veth47db03a0 (on cni0)

Ignore the number starting from 800* i.e the bridge id and id not needed for this discussion

2. POD1(nginx-7db9fccd9b-99z4c) wants to ping POD2 nginx-7db9fccd9b-q4cwz (MAC=8a:94:fa:4c:3b:f7,IP=10.244.3.3)

3. Similar to POD1 , POD2 on VM03 is mapped as below

8a:94:fa:4c:3b:f7 (eth0) ==> vethb1d90a55(on cni0)

4. So to build the ICMP packet POD1 has the following info (a)source mac (b)source ip ©destination IP. However it still needs to get the destination MAC

5. The POD1 looks at its own ipstack and figures the destination ip (10.244.3.3) is in a different network. Hence it needs to send out the packet to the default gateway which in this case is the cni0( IP 10.244.1.1). As a result, POD1 sends out an ARP request (a layer2 broadcast) to get mac address of 10.244.1.1.

4. The CNI interface(cni0) responds to the arp request with its own mac address b6:db:26:2d:2f:dc

5. Thus now POD1 has all the ingredients to create the ICMP packet on cni0 . A tcpdump on cni0 will confirm the same :-

tcpdump -i cni0 -ne icmp -vv -c 1

11:37:28.564132 d6:c5:dd:f3:64:36 > b6:db:26:2d:2f:dc, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 59874, offset 0, flags [DF], proto ICMP (1), length 84)
10.244.1.3 > 10.244.3.3: ICMP echo request, id 347, seq 34503, length 64

SrcMac :- d6:c5:dd:f3:64:36

DestMac :- b6:db:26:2d:2f:dc (cni0 of same VM i.e VM02)

SrcIP: 10.244.1.3

DstIP : 10.244.3.3

6. Once the packet arrives at the CNI interface on the VM02 it looks at its routing table as shown below

default via 192.161.1.2 dev ens160 onlink
10.244.0.0/24 via 10.244.0.0 dev flannel.1 onlink
10.244.1.0/24 dev cni0 proto kernel scope link src 10.244.1.1
10.244.2.0/24 via 10.244.2.0 dev flannel.1 onlink
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink ==> Route present
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
192.161.1.0/24 dev ens160 proto kernel scope link src 192.161.1.5

7. The VM02 then does “packet switching/ip rewrite” and sends it to the flannel.1 interface (vtep- vxlan tunnel endpoint).

tcpdump -i flannel.1 -ne icmp -vv -c 1

11:39:28.690522 56:65:b2:90:18:f2 > 46:d8:ad:31:bc:47, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 9401, offset 0, flags [DF], proto ICMP (1), length 84)
10.244.1.3 > 10.244.3.3: ICMP echo request, id 347, seq 34623, length 64

SrcMac :- 56:65:b2:90:18:f2 (flannel.1 mac of VM02)

DestMac :- 46:d8:ad:31:bc:47 (flannel.1 mac of VM03)

SrcIP: 10.244.1.3

DstIP : 10.244.3.3

Question: How did 46:d8:ad:31:bc:47 appear here?

Answer: This is VXLAN in motion. The remote vtep ip 10.244.3.0 is learned and recorded in the ARP table of the flannel pod running on VM02

kubectl exec kube-flannel-ds-amd64-dzfvh -n kube-system -it — arp -a
? (10.244.2.0) at ba:b9:7e:49:ba:b0 [ether] PERM on flannel.1
? (10.244.1.2) at 5e:40:b5:ea:33:af [ether] on cni0
? (10.244.0.0) at 6a:74:99:94:8d:57 [ether] PERM on flannel.1
? (192.161.1.3) at 00:15:5d:ec:32:17 [ether] on ens160
? (10.244.1.3) at d6:c5:dd:f3:64:36 [ether] on cni0
? (192.161.1.2) at 00:15:5d:ec:32:2b [ether] on ens160
? (192.161.1.1) at 00:15:5d:ec:85:0e [ether] on ens160
? (10.244.3.0) at 46:d8:ad:31:bc:47 [ether] PERM on flannel.1

8. Now the packet is ready to leave VM02 over ens160 . AT this stage the packet is encapsulated in a vxlan packet . A typical VXLAN packet looks like this

VXLAN Frame header components

To correlate if we do a tcpdump on the ens160 interface on VM02 we see this

tcpdump -i ens160 -ne udp -vv -c 1

outer-packet => 11:31:43.208606 00:50:56:90:cc:99 > 00:15:5d:ec:32:2b, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 24057, offset 0, flags [none], proto UDP (17), length 134)
192.161.1.5.55030 > 192.160.0.6.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1
inner packet with vxlan encap => 56:65:b2:90:18:f2 > 46:d8:ad:31:bc:47, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 18473, offset 0, flags [DF], proto ICMP (1), length 84)
10.244.1.3 > 10.244.3.3: ICMP echo request, id 347, seq 34158, length 64

NOTE: We see that the actual POD mac d6:c5:dd:f3:64:36 is not in the frame at all !!.

Breaking down the above packet :

inner packet with vxlan encap => 56:65:b2:90:18:f2 > 46:d8:ad:31:bc:47, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 18473, offset 0, flags [DF], proto ICMP (1), length 84)
10.244.1.3 > 10.244.3.3: ICMP echo request, id 347, seq 34158, length 64

We have already explained how we arrived at the construction of the inner packet. Lets take a look at the outer packet

outer-packet => 11:31:43.208606 00:50:56:90:cc:99 > 00:15:5d:ec:32:2b, ethertype IPv4 (0x0800), length 148: (tos 0x0, ttl 64, id 24057, offset 0, flags [none], proto UDP (17), length 134)
192.161.1.5.55030 > 192.160.0.6.8472: [no cksum] OTV, flags [I] (0x08), overlay 0, instance 1

00:50:56:90:cc:99 => SrcMac of ens160/VM02 interface

00:15:5d:ec:32:2b => DstMac of VM03.

192.161.1.5 => SrcIP of VM02

192.160.0.6 => DstIP of VM03

Question :- From where did we gt the destination mac address of VM03?

Answer :- As mentioned in the beginning there are 3 levels of switching here. a. Once this frame then comes on the VM02 (ens160/eth0) , now the VM02 as the following data :-

00:50:56:90:cc:99 => SrcMac of ens160/VM02 interface

Unknown => DstMac of VM03.

192.161.1.5 => SrcIP of VM02

192.160.0.6 => DstIP of VM03

Payload => Vxlan encapsulated inner ip packet !

Now the VM02's routing table sees that the Destination IP (192.160.0.6) is on a different subnet !. So it seeks out the default gateway/router interface/next hop which is 192.161.1.2 (remember the router on the right side in the diagram). So now it decides to send the packet to the router, and then again its faced with the same question = where is the mac ? Again it sends out an ARP (broadcast) to determine the mac of 192.161.1.2. The router responds with its mac and it turns out to be …. drumroll … 00:15:5d:ec:32:2b !!

00:50:56:90:cc:99 => SrcMac of ens160/VM02 interface

00:15:5d:ec:32:2b => DstMac of Router

192.161.1.5 => SrcIP of VM02

192.160.0.6 => DstIP of VM03

Payload => Vxlan encapsulated inner ip packet !

9. Finally the entire vxlan packet is ready to be sent out of the VM02 towards VM03

10. The packet now leaves the VM02 ==> through the esx vSwitch ==> out of the physical Host

11. Once it leaves the Host, the upstream switch now transparently switches the packet (note: no flooding is required at this stage since upstream switch already know where to send frames headed for VM03 (to the router), due to all the ARP requests sent out earlier by VM02 and response by the router)

12. Once the packet reaches the router, we again to validate do a wireshark capture just to get a more definitive picture

Wireshark capture on router on incoming interface

13. Now the router decapsulates outer ethernet frame and header. It keeps the inner packet intact. Then recreates a new packet with the src mac of outgoing interface and destination mac of VM03 !!

Wireshark capture on router on outgoing interface

Note: Look closely at the mac of the outer packet at the incoming and outgoing interface.

14. Now the packet leaves the router and goes into the physical switch where it is switched to the port where physical esx host is connected that is hosting VM03 . (again , this information is populated due to the ARP’s sent from the router)

15. Once the packet hits VM03 it hits the exact reverse path as it was built

ens160 ==> flannel.1 ==> VM bridge cni0 ==> veth of POD2 ==> eth0 of POD2 .

16. This completes the packet path for ICMP echo request between 2 PODS across 2 Nodes residing to 2 separate physical Hosts and across subnets

17. Now POD2 will build an ICMP response and reply back in the same fashion !!!

Hope this helps in understanding how the frame and packet traverses over the kubernetes components and also explains how kuberntes solves the POD networking (without NAT) .

--

--