An AKS Performance Journey: Part 2 — Networking It Out

Published in

ASOS Tech Blog

15 min readMay 14, 2021

A Helm Chart with Pod DNS configuration templates.

Over the course of 2019 and 2020 my team within ASOS Web started on a journey to migrate some of our front-end micro-services from Azure Cloud Services onto Azure Kubernetes Service (AKS) to explore and utilise the potential for improved service density, scalability, performance and control over our applications.

In Part 1 of this series I covered how we resolved SNAT issues, and evaluated Node SKU and Pod sizing to maximise our application’s performance on AKS. However, despite finding our optimal sizing configuration we were still around 100ms slower than our Cloud Services for upper percentiles.

Observing our application logs, one thing that we were seeing was lots of API timeout errors. Were the APIs not scaled enough for our tests? That couldn’t be it because they were performing beautifully for each control test we ran on Cloud Services.

We were certain that something must have been impacting our outbound request latency made from our Pods to our upstream services — and then we discovered among our custom timeout errors, several consistent errors coming from deep within NodeJS itself: Error: getaddrinfo EAI_AGAIN.

This error means the DNS server replied saying that it cannot currently fulfil the request. This DNS error, paired with the fact that we were seeing consistent 5 second timeouts from the impacted application Pods, nudged us into the rabbit hole of Kubernetes DNS and one of the longest GitHub issues I’ve had the pleasure to read.

Racing around the conntrack

It turns out that both destination network address translation (DNAT) and source network address translation (SNAT) have historically, and still do suffer from a number of race conditions which acuminate in 5s+ request latency. It turns out this explained why our team were seeing consistent 5s latencies, impacting the upper percentiles in our monitoring.

DNAT worries

In Kubernetes, Pods commonly access a DNS server (for example, we use coredns) via Services. When in the default iptables mode for the cluster, kube-proxy creates a few rules in the nat table of the host network namespace to enable address translation.

As a result, each Pod ends up with a nameserver entry populated in it’s /etc/resolv.conf which consequently means any DNS lookups from the pod is sent to the Service ClusterIP of the DNS server. From here the lookup request is load balanced (via a coin flip) across the available DNS servers, and using DNAT the destination IP address of the UDP request packet is updated to the IP address of the DNS server.

The responsibility of DNAT here is to change the destination of outgoing packets (which is used as the source of the reply packets) and ensure that the same modifications are applied to all subsequent packets. The connection tracking mechanism is known as conntrack which is implemented as a kernel module within the Linux netfilter framework.

A request packet is transmitted from the container in the Pod with a src of Pod 1 and dst of coredns svc 1. It exits the Pod via the eth0 interface and travels via the virtual ethernet device to the bridge. The ARP protocol running on the bridge does not know about the Service so it transfers the packet out through the default route — eth0. Before being accepted at eth0 the packet is filtered through iptables which rewrite the destination of the packet from the Service IP to a specific Pod IP.

An issue arises with DNAT and conntrack when two (or more) UDP packets are sent at the same time via the same socket . UDP is a connectionless protocol, so no packet is sent as a result of the connect() system call meaning no conntrack table entry is created. Instead, the conntrack entry is only created when the request packet is actually sent. This leads to the two packets racing each other through the stages of DNAT to reach the entry confirmation stage, where the ‘winning’ packet’s translation is added to the conntrack hash table and the other is likely dropped.

1. Create an entry if it doesn’t already exist and adds it to the unconfirmed list. 2. Find a matching DNAT rule. 3. Update the reply tuple’s src according to the DNAT rule in a way that is not used by any already confirmed conntrack. 4. Update the packet destination port and address according to the reply tuple. 5. Confirm the entry: if there is no existing confirmed entry with either (1) the same source or (2) a reply tuple then the entry is confirmed. If entry already exists then drop.

But hey, this only occurs when we have two racing UDP packets sent via the same socket at the same time right? Well… this is exactly what is happening in the case of DNS lookups. Both the glibc and musl libc perform A and AAAA DNS lookups in parallel. As a result, you find that one of the UDP packets gets dropped by the kernel, which in turn causes the client to retry after a timeout, which is usually five seconds by default.

Seeing lots of five second timeouts? I would recommend you check out your insert_failed counter using conntrack -S to see if it’s going through the roof!

Fortunately, two of the ‘original’ race conditions were mitigated within kernel code back in 2018. But these patches to the netfilter kernel only resolved issues when running a single instance of a DNS server (and simply reduced the impact of the other conditions), so didn’t solve the issue completely.

SNAT issues

Sadly, DNAT isn’t alone in its race issues. In order for Pods to communicate with services external to the cluster the host needs to perform SNAT as the Pod IPs are not routable — the remote service wouldn’t know where to deliver the reply to otherwise! The flow is very similar to DNAT: again the netfilter framework is invoked to perform source tuple replacements and tracking via conntrack.

A packet originates at the Pod’s namespace and travels through the veth pair connected to the root namespace. Once in the root namespace, the packet moves from the bridge to the default device. Before reaching the root namespace’s eth0 device, iptables mangles the packet replacing the src of the Pod with a src of the VM IP. The packets leave’s the VM and reaches the internet gateway. The Internet gateway performs another NAT, rewriting the src IP from the VM IP to an external IP.

The default port allocation algorithm simply increments the port number by one from the last port that was allocated until a new port that is free is found. Because there is a delay between the source port allocation stage and the actual insertion of the connection into the confirmed conntrack table, parallel requests can end up with the same port resulting in one of the packets being dropped.

Luckily netfilter supports two other algorithms to find free ports for SNAT:

Using a small degree of randomness to set the port allocation offset with the flag NF_NAT_RANGE_PROTO_RANDOM.
Using full randomness via the flag NF_NAT_RANGE_PROTO_RANDOM_FULLY, which simply randomises the port search offset every time.

Using the latter of these two algorithms greatly reduces the risk of parallel requests being assigned the same port and packets being dropped.

This was implemented in kube-proxy, adding the --random-fully flag to the MASQUERADE rule for iptables, in September 2019, landing in the beta version of Kubernetes 1.16.0. However, the benefits have only been realised as late as mid 2020 for some, when cloud providers upgraded their version of iptables to 1.6.2 or greater.

So how can everyone be a winner in these races?

Luckily the SNAT issues have been solved provided you are using a newish version of Kubernetes and your package versions are up to date, but we’re still left with some DNAT issues to resolve.

One at a time please

If you’re not running Alpine, then you’re in luck! glibc supports two options in the resolv.conf which can solve these issues:

single-request — forces glibc to perform IPv6 and IPv4 requests sequentially.
single-request-reopen— when the hardware mistakenly sends back only one reply for A and AAAA requests sent over the same socket, makes glibc close the socket in failure scenarios and open a new one before resending the second request.

Configuration can be applied to the resolv.conf using the Pod’s dnsConfig options block introduced in Kubernetes 1.9.0.

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: single-request-reopen

Unfortunately, this isn’t a solution for Alpine users as the options aren’t supported by musl, and it is unlikely that they ever will be as ‘sequential lookups are against their architecture’ according to #musl IRC channel.

Is it still a race if there’s only one in it?

An alternative solution to the parallel request race issue is to remove one of the requests from the equation. Provided you don’t need both IPv4 and IPv6 you can simply drop one. For example, if we drop IPv6 then we won’t make the AAAA lookup requests and the chances of a race condition are greatly reduced.

To force NodeJS to only make IPv4 requests you use the family option set to 4 when calling http.request(url [, options][, callback]), or if you are using a custom http agent, you can also pass the same family property when creating the agent instance new Agent([options]).

Upgrade all the things

Another consideration is removing the issues surfaced with UDP DNS queries by replacing it with TCP. Unlike UDP, TCP is not lossy meaning you shouldn’t suffer from the same dropped packet issues. This does come at the expense of a slight slowdown for any individual successful request, due to TCP simply being a ‘heavier’ protocol. However, ultimately it can reduce the overall tail latency by some margin by removing 15s of request time (3 retries x 5s timeouts) in failure cases.

To force the use of TCP for DNS resolution set the use-vc resolv.conf option in the Pod’s dnsConfig options block.

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: use-vc

Everyone gets a DNS server

Further to the above suggestions, another way to mitigate the race conditions in DNAT, caused by having multiple DNS servers, is to run both coredns and dnsmasq as daemonsets. By doing this each Node gets its own dedicated DNS server so two racing packets in DNAT will be sent to the same DNS server, instead of there being a chance of being routed to different servers.

Alternatively, you can opt to run a single DNS server in your cluster, but for any large workload this may be ill-advised due to the pressure it would put on the single server which would have to cater for the entire cluster.

Curious default policies which aren’t default

One curious finding we had while investigating further into the Kubernetes DNS setup was that the default Kubernetes dnsPolicy is ClusterFirst and not the Default policy!

The default of ClusterFirst means ‘any DNS query that does match the configured cluster domain suffix is forwarded to the upstream nameserver inherited from the node’ whereas the Default policy results in Pods inheriting ‘the name resolution configuration from the node that the pods run on’.

apiVersion: v1
kind: Pod
spec:
  dnsPolicy: Default

By setting the dnsPolicy to Default we did observe a slight improvement in performance, and there have been reports of improvements in the community — see here and here. This does come at the cost of Kubernetes no longer injecting the search domains for the cluster like svc.cluster.local, meaning Pods can no longer resolve Services using short names. Instead, you are forced to use the full CNAME to reference Services.

Discovering that Kubernetes has a set of default search domains also led us to another avenue of exploration regarding another default configuration that Kubernetes places in the Pods’ resolv.conf —ndots.

The default for the ndots option in Kubernetes is 5 which means any name that contains fewer than five ‘dots’ in it will first be resolved sequentially against all the local search domains before finally attempting to resolve it as an absolute name.

The default list of search domains for Kubernetes consists of four domains:

<namespace>.svc.cluster.local, for example kube-system.svc.cluster.local
svc.cluster.local
cluster.local
<cloud_service_provider_specific_domain>, for example example.fx.internal.cloudapp.net

Say we want to make an external request to www.asos.com. Instead of resolving our domain straight away, Kubernetes will instead attempt to resolve the following first:

www.asos.com.<namespace>.svc.cluster.local, for example www.asos.com.kube-system.svc.cluster.local
www.asos.com.svc.cluster.local
www.asos.com.cluster.local
www.asos.com.<cloud_service_provider_specific_domain>, for example www.asos.com.example.fx.internal.cloudapp.net

Only after these four lookups will it then attempt to resolve www.asos.com directly.

Given the default behaviour is also to make requests for both A and AAAA records in parallel, we find that the combination of the default ndots and search options Kubernetes provides results in a total of 10 DNS resolutions. Given what we know about the various conntrack race conditions, this multiplication of requests magnifies the chances of having a request packet dropped and an increased number of timeouts.

We’re qualified to make our own decisions

One way to mitigate these excessive lookups is to make use of fully qualified domain names (FQDN) in your requests. To signify that a domain is fully qualified, simply add a trailing dot . to the domain, for example www.asos.com.. When faced with a FQDN the local search domains are ignored and the provided FQDN is used as-is for DNS resolution.

Fewer dots hit the spot

An alternative to updating all of your application code, environment variables etc. to make use of FQDN everywhere is to modify the ndots option in the dnsConfig for your Pods. For our previous example, setting the ndots option to 2 means that only domains with one or no ‘dots’ will be resolved against the local search domains, so www.asos.com will be immediately used for resolution.

apiVersion: v1
kind: Pod
spec:
  dnsConfig:
    options:
      - name: ndots
        value: 2

Applying this configuration to our Pods in our clusters saw a 20% performance improvement across all percentiles.

Journeys into NodeJS

Having settled on some Kubernetes DNS configuration that seemed to be working well for us (it placed us roughly around the same latency profile as we had for Cloud Services), we decided to start looking at whether there was more fine-tuning we could do with the NodeJS applications themselves. We were still seeing a fair number of timeouts with our outbound requests, and our SNAT usage, though under control, felt like it was running quite ‘hot’.

NodeJS is async, except when it isn’t

Despite NodeJS being generally described as having an async single-threaded event loop model architecture, there are a select few OS operations that are fully synchronous which even NodeJS can’t get around — DNS lookup is one of them!

To ‘maintain’ it’s async, event loop driven architecture, NodeJS handles these synchronous system API calls by offloading them to a pool of libuv threads. There are four threads by default for a NodeJS process. If a NodeJS process saturates it’s libuv threads, then subsequent system API calls get blocked and are placed in a queue until a thread becomes free.

What we discovered was our NodeJS micro-frontend application could easily make up to five simultaneous DNS lookups due to a number of requests made in parallel to external, upstream micro-services (APIs, Redis, Blob Storage etc.) which would saturate the four libuv threads, and under high QPS could easily cascade into a large, blocked queue of DNS lookups which would subsequently result in request timeouts from our request library (and the request hadn’t even started really!).

Running a few controlled tests we were able to find that simply doubling the number of libuv threads to eight was sufficient to alleviate this strain and saw a nice performance improvement of around 3%. What was also noticeable was just how much smoother our latency graphs looked after the change — likely due to removing the competition for the libuv threads which could result in sudden lock-ups if there was a transient burst in QPS which saturated the threads.

Graph of latency percentiles vs time. 50 percentile averages 277ms, 75 percentile averages 342ms and 95 percentile averages 467ms. Lower percentiles vary by about 10ms over the course of the 4 minutes, the 95 percentile is more volatile, varying by up to 90ms. Percentile lines are not smooth with lots of small variations which are more apparant in the 95 percentile. — UV_THREADPOOL_SIZE = 4

Graph of latency percentiles vs time. 50 percentile averages 267ms, 75 percentile averages 331ms and 95 percentile averages 442ms. Lower percentiles vary by about 10ms over the course of the 12 minutes, the 95 percentile varys by up to 20ms. In general the percentile profiles are very flat and smooth other than a gradual peak starting at 8 minutes, peaking at 10 minutes and gradually dropping back down. — UV_THREADPOOL_SIZE = 8

To set the number of libuv threads for you NodeJS application, simply set a UV_THREADPOOL_SIZE environment variable in the context where you will be starting the NodeJS process. For example, by using the ENV directive in your NodeJS Dockerfile.

Let the connections live

Following up on our high SNAT usage, our liaison at Microsoft pointed out that one way to side-step the need to create so many SNAT ports would be if we could get better re-use out the existing connections we made. This turned our attention to keep-alives.

For HTTP/1 a Connection and Keep-Alive header can be provided on requests to allow the sender to hint about how the connection may be used to set a timeout and a maximum number of requests.

By instructing the upstream services that we wish to keep the connection alive and re-use it, we can simply pass multiple requests through a single connection meaning there is no need to open any further SNAT ports. This reduces the overhead of having to do repeated DNS round-trips, handshakes, and connection setups which can save greatly on time. Furthermore, diminishing the number of new connections decreases the DNAT and SNAT load, consequently reducing the chance of encountering one of the race conditions we’ve discussed previously in this article.

In NodeJS you can configure keep-alives using a custom HTTP Agent via multiple settings, for example the keepAlive option. If you’re just looking for sensible defaults, take a look at the agentkeepalive NPM module which sets you up with reasonable defaults out-of-the-box (which you can then fine tune).

Implementing keep-alives in our NodeJS applications saw dramatic improvements in our latency metrics, cutting our latency in half across percentiles for most scenarios. On top of this fantastic result, we also observed that the combined changes to our NodeJS applications and Kubernetes configuration was now around 30% faster on AKS than on Cloud Services, and this was including the various NodeJS changes (libuv threads, keep-alives) on Cloud Services. Comparing the AKS setup to our original Cloud Service benchmark (no NodeJS changes) we were over twice as fast!

But wait there’s more…?

Following the dramatic performance improvements, we were able to successfully go live with our application on AKS and have been successfully serving production traffic for the past few months.

One thing with performance is that it is never over — there is always another way to knock off some time and provide an even better experience for customers! Here I want to briefly mention two developments in the AKS landscape which can further improve your cluster networking performance.

Node-Local DNSCache

In May 2019 Pavithra Ramesh & Blake Barnett gave a great talk at the Europe Cloud Native Computing Foundation (CNCF) KubeCon event in which they introduced the Node-Local DNSCache as a solution to the various race conditions we’ve discussed in this article.

The solution is to place a local DNS cache on every Node within the cluster so that Pods on a Node need only interrogate the cache rather than request through to coredns. Because the cache sits on every Node, it also removes the potential for Pod DNS requests to have to reach out to a different Node, improving the round-trip latency. Furthermore, by caching results DNAT can be avoided once the cache is populated reducing the potential for conntrack races.

For the cache to be populated, coredns does need to be interrogated at some point, but Node-Local DNSCache upgrades these requests to TCP meaning conntrack entries are removed on connection close (whereas UDP entries have to time out), and the overall tail latency on DNS queries is reduced due to mitigating dropped UDP packets.

The client Pod makes a DNS request over UDP or TCP to the Local DNS Cache at 169.254.20.10. resolv.conf of Node has rule of cluster.local pointing to 10.0.0.10 corresponding to KubeDNS ClusterIP. In scenario 1, this returns with a cache hit. In scenario 2, there is a cache miss. The request is upgraded to TCP and forwarded to KubeDNS at ClusterIP 10.0.0.10. Via IP Tables the request is forwarded to KubeDNS Pods which respond, and the response is passed back via Local DNS Cache to the Client. — Source: https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

In a way, this add-on is providing a combination of the previously discussed solutions out-of-the-box — similar to using the use-vc resolv.conf option combined with running coredns as a daemonset. The main difference being that here we are running a bespoke DNS cache as a daemonset, which acts as a middleware for requests to coredns, and not running coredns itself as a daemonset.

To date (February 2021), Azure doesn’t yet provide Node-Local DNSCache as one of its supported AKS add-ons so we haven’t been able to use this feature in our clusters. It has been stable since Kubernetes 1.18.0 so you might find your cloud service provider offers it, or if you’re running your own setup, that you can install manually yourself.

Interestingly another feature was released by Microsoft for AKS which may even negate the need for NodeLocal DNSCache!

Azure CNI Transparent Mode

In January 2021 Microsoft released an update to the default settings for Azure CNI which changes the default to use Transparent Mode. This new default is applied to all new clusters, and is automatically applied to any existing clusters when they are upgraded.

Transparent mode topology — Source: https://docs.microsoft.com/en-us/azure/aks/faq

This change places Pod to Pod communication over layer 3 (L3) instead of the old layer 2 (L2) bridge, using IP routes which the CNI adds. The benefits look very promising, the main ones related to our discussion being:

It provides mitigation for the conntrack race conditions we’ve discussed without the need to set up NodeLocal DNSCache.
It eliminates some additional 5s DNS latency which the old Bridge mode introduced due to a ‘just in time’ setup.

We are excited to see how this might help further improve our AKS application performance here at ASOS. It is also worth noting that this CNI change doesn’t prevent you from also using NodeLocal DNSCache. Using the caching add-on may still provide additional performance benefits through having a locally situated cache rather than requests potentially being made inter-Node to interrogate coredns.

Conclusion

It’s been quite a journey for our team to navigate the complexities of high performance on AKS! This series has been anything but brief, and yet still doesn’t cover everything that the team explored during this process, for example:

Learnings with optimal Taurus and JMeter configurations in ACI
Nuances with requests versus limits for resources
Kubernetes scheduling quirks and implementing Descheduler
Nginx ingress controller tuning
Flags and config options to enable logging and metrics for monitoring cluster networking

We’ve covered how to best configure both Pod DNS options and NodeJS applications for optimal networking performance in AKS. Finally, we have been pleased to share how AKS has delivered on its promise — it has been a great success from a performance perspective, and the journey isn’t over yet with new exciting features becoming available as recently as January this year!

References, resources and reading

Hi, my name is Craig Morten. I am a senior web engineer at ASOS. When I’m not hunched over my laptop I can be found drinking excessive amounts of tea or running around in circles at my local athletics track.

ASOS are hiring across a range of roles. If you love Kubernetes and are excited by AKS, we would love to hear from you! NodeJS? Same again! See our open positions here.