A Strange Case of TCP Packet Loss in Microsoft Azure Kubernetes Pods

yohan welikala
Geek Culture
Published in
4 min readAug 2, 2021

If you have read my previous articles a word of caution : this gets very technical.

In years past when there was a issue such as packet loss the troubleshooting was quite straightforward. You try to isolate which component is causing the packet loss by observing the packet loss from each node connecting the two endpoints. Albeit the principle remains the same in todays deployment architecture there is so much abstraction going on that you can easily be abstracted away from the real culprit. To confound that, if you do not have complete control of all the infrastructure components you might end up spending quite a bit of time sending emails back and forth trying to get all the diagnostic information. For example if you are on Microsoft Azure Cloud you wont be able to do good old fashioned traceroute because that will be disabled by the Azure network components.

Our story starts with the symptoms : We observe that our application running on a Kubernetes cluster on Azure, experiences sudden network timeouts. We try to see if there is a connectivity issue between the pod running the application and the target remote server running somewhere else on the internet. We quickly eliminate any possibility that it is a general issue with the remote server by testing from a pod on a different network. Therefore it has to either be an issue in the path our packets are taking or at the origin of the packets. We also observe that bypassing the Azure firewall seems to reduce the packet loss. We check if we are creating too many connections such that we are exhausting our SNAT limit on the firewall but find its well within the threshold. We also create a Windows VM within the firewall and viola! that does not have the packet loss issue.

Many Layers of Network Stacks

This lead me to investigate the packet loss a bit deeper on our Linux VM’s. This is where it gets tricky because we don't have a simple Linux box sending packets out to the internet.

We have many OS’s running their own network stacks some of which are hybrids.

We start looking at what's happening at the pod OS and the best place to see what's happening with your TCP packets intermittently is netstat -st as this gives you the overall statistics and you can determine whether what's happening is within reasonable limits for your network. If you are not someone willing to go through lots of Linux code I’m afraid there is limited resources in explaining all the counters in netstat so if you cant find it on google be prepared to delve into the code. Here is a snippet of the output we got from netstat:

netstat -st command output

Looking at the stats one can observe that the network seems to be behaving ok as counters such as retransmission rate are within reasonable values ( <1% retransmission ). What jumps out immediately is the highlighted line which tells you there is something wrong with the source being able to consume the data that is arriving at it. Being a buffer overrun you would think the solution is a simple increase of the kernel receive buffer. However this is not the case and here is a good article on why its not.

Whilst we now seemed to have isolated the problem and knew there was a way to solve the problem by tweaking the kernel parameters I was also curious as to the difference in behavior between the Linux and the Windows VM’s. One of the major differences I noted was the congestion control algorithm used. For some incomprehensible reason the CentOS images that are there in Azure are configured to use cubic as the algorithm. Whilst this was great in 100 Mbps networks it doesn’t cope well with todays Gigabyte speeds backing off too soon too fast. In the Kubernetes “Node” which was running Ubuntu the congestion control algorithm was htcp and we did not experience the packet loss when we ran the diagnostics from the Node. Windows uses dctcp. There are some articles that claim they achieved greater bandwidth using Googles BBR that really seems to only work on the client side for me and might be good for your PC’s and give marginally better performance over dctcp. We switched the congestion control to htcp and we were also able to show that it solved the packet loss issue.

In conclusion we learned that there are multiples layers of abstraction in todays cloud infrastructure systems making troubleshooting harder. There is a huge need for proper tooling to diagnose these without compromising network security. Its high time someone wrote a new spec in place of ICMP for diagnosis.

--

--