An AKS Performance Journey: Part 2 — Networking It Out

Craig Morten
May 14 · 15 min read
A Helm Chart with Pod DNS configuration templates.

Over the course of 2019 and 2020 my team within ASOS Web started on a journey to migrate some of our front-end micro-services from Azure Cloud Services onto Azure Kubernetes Service (AKS) to explore and utilise the potential for improved service density, scalability, performance and control over our applications.

In Part 1 of this series I covered how we resolved SNAT issues, and evaluated Node SKU and Pod sizing to maximise our application’s performance on AKS. However, despite finding our optimal sizing configuration we were still around 100ms slower than our Cloud Services for upper percentiles.

Observing our application logs, one thing that we were seeing was lots of API timeout errors. Were the APIs not scaled enough for our tests? That couldn’t be it because they were performing beautifully for each control test we ran on Cloud Services.

We were certain that something must have been impacting our outbound request latency made from our Pods to our upstream services — and then we discovered among our custom timeout errors, several consistent errors coming from deep within NodeJS itself: Error: getaddrinfo EAI_AGAIN.

This error means the DNS server replied saying that it cannot currently fulfil the request. This DNS error, paired with the fact that we were seeing consistent 5 second timeouts from the impacted application Pods, nudged us into the rabbit hole of Kubernetes DNS and one of the longest GitHub issues I’ve had the pleasure to read.

Racing around the conntrack

DNAT worries

As a result, each Pod ends up with a nameserver entry populated in it’s /etc/resolv.conf which consequently means any DNS lookups from the pod is sent to the Service ClusterIP of the DNS server. From here the lookup request is load balanced (via a coin flip) across the available DNS servers, and using DNAT the destination IP address of the UDP request packet is updated to the IP address of the DNS server.

The responsibility of DNAT here is to change the destination of outgoing packets (which is used as the source of the reply packets) and ensure that the same modifications are applied to all subsequent packets. The connection tracking mechanism is known as conntrack which is implemented as a kernel module within the Linux netfilter framework.

A request packet is transmitted from the container in the Pod with a src of Pod 1 and dst of coredns svc 1. It exits the Pod via the eth0 interface and travels via the virtual ethernet device to the bridge. The ARP protocol running on the bridge does not know about the Service so it transfers the packet out through the default route — eth0. Before being accepted at eth0 the packet is filtered through iptables which rewrite the destination of the packet from the Service IP to a specific Pod IP.
A request packet is transmitted from the container in the Pod with a src of Pod 1 and dst of coredns svc 1. It exits the Pod via the eth0 interface and travels via the virtual ethernet device to the bridge. The ARP protocol running on the bridge does not know about the Service so it transfers the packet out through the default route — eth0. Before being accepted at eth0 the packet is filtered through iptables which rewrite the destination of the packet from the Service IP to a specific Pod IP.

An issue arises with DNAT and conntrack when two (or more) UDP packets are sent at the same time via the same socket . UDP is a connectionless protocol, so no packet is sent as a result of the connect() system call meaning no conntrack table entry is created. Instead, the conntrack entry is only created when the request packet is actually sent. This leads to the two packets racing each other through the stages of DNAT to reach the entry confirmation stage, where the ‘winning’ packet’s translation is added to the conntrack hash table and the other is likely dropped.

1. Create an entry if it doesn’t already exist and adds it to the unconfirmed list. 2. Find a matching DNAT rule. 3. Update the reply tuple’s src according to the DNAT rule in a way that is not used by any already confirmed conntrack. 4. Update the packet destination port and address according to the reply tuple. 5. Confirm the entry: if there is no existing confirmed entry with either (1) the same source or (2) a reply tuple then the entry is confirmed. If entry already exists then drop.
1. Create an entry if it doesn’t already exist and adds it to the unconfirmed list. 2. Find a matching DNAT rule. 3. Update the reply tuple’s src according to the DNAT rule in a way that is not used by any already confirmed conntrack. 4. Update the packet destination port and address according to the reply tuple. 5. Confirm the entry: if there is no existing confirmed entry with either (1) the same source or (2) a reply tuple then the entry is confirmed. If entry already exists then drop.

But hey, this only occurs when we have two racing UDP packets sent via the same socket at the same time right? Well… this is exactly what is happening in the case of DNS lookups. Both the glibc and musl libc perform A and AAAA DNS lookups in parallel. As a result, you find that one of the UDP packets gets dropped by the kernel, which in turn causes the client to retry after a timeout, which is usually five seconds by default.

Seeing lots of five second timeouts? I would recommend you check out your insert_failed counter using conntrack -S to see if it’s going through the roof!

Fortunately, two of the ‘original’ race conditions were mitigated within kernel code back in 2018. But these patches to the netfilter kernel only resolved issues when running a single instance of a DNS server (and simply reduced the impact of the other conditions), so didn’t solve the issue completely.

SNAT issues

A packet originates at the Pod’s namespace and travels through the veth pair connected to the root namespace. Once in the root namespace, the packet moves from the bridge to the default device. Before reaching the root namespace’s eth0 device, iptables mangles the packet replacing the src of the Pod with a src of the VM IP. The packets leave’s the VM and reaches the internet gateway. The Internet gateway performs another NAT, rewriting the src IP from the VM IP to an external IP.
A packet originates at the Pod’s namespace and travels through the veth pair connected to the root namespace. Once in the root namespace, the packet moves from the bridge to the default device. Before reaching the root namespace’s eth0 device, iptables mangles the packet replacing the src of the Pod with a src of the VM IP. The packets leave’s the VM and reaches the internet gateway. The Internet gateway performs another NAT, rewriting the src IP from the VM IP to an external IP.

The default port allocation algorithm simply increments the port number by one from the last port that was allocated until a new port that is free is found. Because there is a delay between the source port allocation stage and the actual insertion of the connection into the confirmed conntrack table, parallel requests can end up with the same port resulting in one of the packets being dropped.

Luckily netfilter supports two other algorithms to find free ports for SNAT:

  1. Using a small degree of randomness to set the port allocation offset with the flag NF_NAT_RANGE_PROTO_RANDOM.
  2. Using full randomness via the flag NF_NAT_RANGE_PROTO_RANDOM_FULLY, which simply randomises the port search offset every time.

Using the latter of these two algorithms greatly reduces the risk of parallel requests being assigned the same port and packets being dropped.

This was implemented in kube-proxy, adding the --random-fully flag to the MASQUERADE rule for iptables, in September 2019, landing in the beta version of Kubernetes 1.16.0. However, the benefits have only been realised as late as mid 2020 for some, when cloud providers upgraded their version of iptables to 1.6.2 or greater.

So how can everyone be a winner in these races?

One at a time please

  1. single-request — forces glibc to perform IPv6 and IPv4 requests sequentially.
  2. single-request-reopen— when the hardware mistakenly sends back only one reply for A and AAAA requests sent over the same socket, makes glibc close the socket in failure scenarios and open a new one before resending the second request.

Configuration can be applied to the resolv.conf using the Pod’s dnsConfig options block introduced in Kubernetes 1.9.0.

apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: single-request-reopen

Unfortunately, this isn’t a solution for Alpine users as the options aren’t supported by musl, and it is unlikely that they ever will be as ‘sequential lookups are against their architecture’ according to #musl IRC channel.

Is it still a race if there’s only one in it?

To force NodeJS to only make IPv4 requests you use the family option set to 4 when calling http.request(url [, options][, callback]), or if you are using a custom http agent, you can also pass the same family property when creating the agent instance new Agent([options]).

Upgrade all the things

To force the use of TCP for DNS resolution set the use-vc resolv.conf option in the Pod’s dnsConfig options block.

apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: use-vc

Everyone gets a DNS server

Alternatively, you can opt to run a single DNS server in your cluster, but for any large workload this may be ill-advised due to the pressure it would put on the single server which would have to cater for the entire cluster.

Curious default policies which aren’t default

The default of ClusterFirst means ‘any DNS query that does match the configured cluster domain suffix is forwarded to the upstream nameserver inherited from the node’ whereas the Default policy results in Pods inheriting ‘the name resolution configuration from the node that the pods run on’.

apiVersion: v1
kind: Pod
spec:
dnsPolicy: Default

By setting the dnsPolicy to Default we did observe a slight improvement in performance, and there have been reports of improvements in the community — see here and here. This does come at the cost of Kubernetes no longer injecting the search domains for the cluster like svc.cluster.local, meaning Pods can no longer resolve Services using short names. Instead, you are forced to use the full CNAME to reference Services.

Discovering that Kubernetes has a set of default search domains also led us to another avenue of exploration regarding another default configuration that Kubernetes places in the Pods’ resolv.confndots.

The default for the ndots option in Kubernetes is 5 which means any name that contains fewer than five ‘dots’ in it will first be resolved sequentially against all the local search domains before finally attempting to resolve it as an absolute name.

The default list of search domains for Kubernetes consists of four domains:

  • <namespace>.svc.cluster.local, for example kube-system.svc.cluster.local
  • svc.cluster.local
  • cluster.local
  • <cloud_service_provider_specific_domain>, for example example.fx.internal.cloudapp.net

Say we want to make an external request to www.asos.com. Instead of resolving our domain straight away, Kubernetes will instead attempt to resolve the following first:

  1. www.asos.com.<namespace>.svc.cluster.local, for example www.asos.com.kube-system.svc.cluster.local
  2. www.asos.com.svc.cluster.local
  3. www.asos.com.cluster.local
  4. www.asos.com.<cloud_service_provider_specific_domain>, for example www.asos.com.example.fx.internal.cloudapp.net

Only after these four lookups will it then attempt to resolve www.asos.com directly.

Given the default behaviour is also to make requests for both A and AAAA records in parallel, we find that the combination of the default ndots and search options Kubernetes provides results in a total of 10 DNS resolutions. Given what we know about the various conntrack race conditions, this multiplication of requests magnifies the chances of having a request packet dropped and an increased number of timeouts.

We’re qualified to make our own decisions

Fewer dots hit the spot

apiVersion: v1
kind: Pod
spec:
dnsConfig:
options:
- name: ndots
value: 2

Applying this configuration to our Pods in our clusters saw a 20% performance improvement across all percentiles.

Journeys into NodeJS

NodeJS is async, except when it isn’t

To ‘maintain’ it’s async, event loop driven architecture, NodeJS handles these synchronous system API calls by offloading them to a pool of libuv threads. There are four threads by default for a NodeJS process. If a NodeJS process saturates it’s libuv threads, then subsequent system API calls get blocked and are placed in a queue until a thread becomes free.

What we discovered was our NodeJS micro-frontend application could easily make up to five simultaneous DNS lookups due to a number of requests made in parallel to external, upstream micro-services (APIs, Redis, Blob Storage etc.) which would saturate the four libuv threads, and under high QPS could easily cascade into a large, blocked queue of DNS lookups which would subsequently result in request timeouts from our request library (and the request hadn’t even started really!).

Running a few controlled tests we were able to find that simply doubling the number of libuv threads to eight was sufficient to alleviate this strain and saw a nice performance improvement of around 3%. What was also noticeable was just how much smoother our latency graphs looked after the change — likely due to removing the competition for the libuv threads which could result in sudden lock-ups if there was a transient burst in QPS which saturated the threads.

Graph of latency percentiles vs time. 50 percentile averages 277ms, 75 percentile averages 342ms and 95 percentile averages 467ms. Lower percentiles vary by about 10ms over the course of the 4 minutes, the 95 percentile is more volatile, varying by up to 90ms. Percentile lines are not smooth with lots of small variations which are more apparant in the 95 percentile.
Graph of latency percentiles vs time. 50 percentile averages 277ms, 75 percentile averages 342ms and 95 percentile averages 467ms. Lower percentiles vary by about 10ms over the course of the 4 minutes, the 95 percentile is more volatile, varying by up to 90ms. Percentile lines are not smooth with lots of small variations which are more apparant in the 95 percentile.
UV_THREADPOOL_SIZE = 4
Graph of latency percentiles vs time. 50 percentile averages 267ms, 75 percentile averages 331ms and 95 percentile averages 442ms. Lower percentiles vary by about 10ms over the course of the 12  minutes, the 95 percentile varys by up to 20ms. In general the percentile profiles are very flat and smooth other than a gradual peak starting at 8 minutes, peaking at 10 minutes and gradually dropping back down.
Graph of latency percentiles vs time. 50 percentile averages 267ms, 75 percentile averages 331ms and 95 percentile averages 442ms. Lower percentiles vary by about 10ms over the course of the 12  minutes, the 95 percentile varys by up to 20ms. In general the percentile profiles are very flat and smooth other than a gradual peak starting at 8 minutes, peaking at 10 minutes and gradually dropping back down.
UV_THREADPOOL_SIZE = 8

To set the number of libuv threads for you NodeJS application, simply set a UV_THREADPOOL_SIZE environment variable in the context where you will be starting the NodeJS process. For example, by using the ENV directive in your NodeJS Dockerfile.

Let the connections live

For HTTP/1 a Connection and Keep-Alive header can be provided on requests to allow the sender to hint about how the connection may be used to set a timeout and a maximum number of requests.

By instructing the upstream services that we wish to keep the connection alive and re-use it, we can simply pass multiple requests through a single connection meaning there is no need to open any further SNAT ports. This reduces the overhead of having to do repeated DNS round-trips, handshakes, and connection setups which can save greatly on time. Furthermore, diminishing the number of new connections decreases the DNAT and SNAT load, consequently reducing the chance of encountering one of the race conditions we’ve discussed previously in this article.

In NodeJS you can configure keep-alives using a custom HTTP Agent via multiple settings, for example the keepAlive option. If you’re just looking for sensible defaults, take a look at the agentkeepalive NPM module which sets you up with reasonable defaults out-of-the-box (which you can then fine tune).

Implementing keep-alives in our NodeJS applications saw dramatic improvements in our latency metrics, cutting our latency in half across percentiles for most scenarios. On top of this fantastic result, we also observed that the combined changes to our NodeJS applications and Kubernetes configuration was now around 30% faster on AKS than on Cloud Services, and this was including the various NodeJS changes (libuv threads, keep-alives) on Cloud Services. Comparing the AKS setup to our original Cloud Service benchmark (no NodeJS changes) we were over twice as fast!

But wait there’s more…?

One thing with performance is that it is never over — there is always another way to knock off some time and provide an even better experience for customers! Here I want to briefly mention two developments in the AKS landscape which can further improve your cluster networking performance.

Node-Local DNSCache

The solution is to place a local DNS cache on every Node within the cluster so that Pods on a Node need only interrogate the cache rather than request through to coredns. Because the cache sits on every Node, it also removes the potential for Pod DNS requests to have to reach out to a different Node, improving the round-trip latency. Furthermore, by caching results DNAT can be avoided once the cache is populated reducing the potential for conntrack races.

For the cache to be populated, coredns does need to be interrogated at some point, but Node-Local DNSCache upgrades these requests to TCP meaning conntrack entries are removed on connection close (whereas UDP entries have to time out), and the overall tail latency on DNS queries is reduced due to mitigating dropped UDP packets.

The client Pod makes a DNS request over UDP or TCP to the Local DNS Cache at 169.254.20.10. resolv.conf of Node has rule of cluster.local pointing to 10.0.0.10 corresponding to KubeDNS ClusterIP. In scenario 1, this returns with a cache hit. In scenario 2, there is a cache miss. The request is upgraded to TCP and forwarded to KubeDNS at ClusterIP 10.0.0.10. Via IP Tables the request is forwarded to KubeDNS Pods which respond, and the response is passed back via Local DNS Cache to the Client.
The client Pod makes a DNS request over UDP or TCP to the Local DNS Cache at 169.254.20.10. resolv.conf of Node has rule of cluster.local pointing to 10.0.0.10 corresponding to KubeDNS ClusterIP. In scenario 1, this returns with a cache hit. In scenario 2, there is a cache miss. The request is upgraded to TCP and forwarded to KubeDNS at ClusterIP 10.0.0.10. Via IP Tables the request is forwarded to KubeDNS Pods which respond, and the response is passed back via Local DNS Cache to the Client.
Source: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

In a way, this add-on is providing a combination of the previously discussed solutions out-of-the-box — similar to using the use-vc resolv.conf option combined with running coredns as a daemonset. The main difference being that here we are running a bespoke DNS cache as a daemonset, which acts as a middleware for requests to coredns, and not running coredns itself as a daemonset.

To date (February 2021), Azure doesn’t yet provide Node-Local DNSCache as one of its supported AKS add-ons so we haven’t been able to use this feature in our clusters. It has been stable since Kubernetes 1.18.0 so you might find your cloud service provider offers it, or if you’re running your own setup, that you can install manually yourself.

Interestingly another feature was released by Microsoft for AKS which may even negate the need for NodeLocal DNSCache!

Azure CNI Transparent Mode

Transparent mode topology
Transparent mode topology
Source: https://docs.microsoft.com/en-us/azure/aks/faq

This change places Pod to Pod communication over layer 3 (L3) instead of the old layer 2 (L2) bridge, using IP routes which the CNI adds. The benefits look very promising, the main ones related to our discussion being:

  • It provides mitigation for the conntrack race conditions we’ve discussed without the need to set up NodeLocal DNSCache.
  • It eliminates some additional 5s DNS latency which the old Bridge mode introduced due to a ‘just in time’ setup.

We are excited to see how this might help further improve our AKS application performance here at ASOS. It is also worth noting that this CNI change doesn’t prevent you from also using NodeLocal DNSCache. Using the caching add-on may still provide additional performance benefits through having a locally situated cache rather than requests potentially being made inter-Node to interrogate coredns.

Conclusion

  • Learnings with optimal Taurus and JMeter configurations in ACI
  • Nuances with requests versus limits for resources
  • Kubernetes scheduling quirks and implementing Descheduler
  • Nginx ingress controller tuning
  • Flags and config options to enable logging and metrics for monitoring cluster networking

We’ve covered how to best configure both Pod DNS options and NodeJS applications for optimal networking performance in AKS. Finally, we have been pleased to share how AKS has delivered on its promise — it has been a great success from a performance perspective, and the journey isn’t over yet with new exciting features becoming available as recently as January this year!

References, resources and reading

Hi, my name is Craig Morten. I am a senior web engineer at ASOS. When I’m not hunched over my laptop I can be found drinking excessive amounts of tea or running around in circles at my local athletics track.

ASOS are hiring across a range of roles. If you love Kubernetes and are excited by AKS, we would love to hear from you! NodeJS? Same again! See our open positions here.

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and…