An AKS Performance Journey: Part 1 — Sizing Everything Up

Craig Morten
Apr 13 · 9 min read
A Helm Chart with Pod DNS configuration templates.

Over the course of 2019 and 2020 my team within ASOS Web started on a journey to migrate some of our front-end micro-services from Azure Cloud Services onto Azure Kubernetes Service (AKS) to explore and utilise the potential for improved service density, scalability, performance and control over our applications.

As part of any major architectural change at ASOS, we scrutinise the new setup in a series of gates to ensure that we can deliver the best experience for our customers. In this 2-part series I will be sharing my team’s learnings in signing off our Product Listing Page (PLP) application’s performance on AKS.

An iPhone with the ASOS website open on the Topshop PLP category page.
An iPhone with the ASOS website open on the Topshop PLP category page.

The PLP is an application that serves all of the category, search, and brands pages on the ASOS website, and is an important part of the customer journey — allowing them to search through our product ranges. It has the second highest load profile of our front-end applications, needing to respond to hundreds of queries per second (QPS) and elastically respond to sharp peaks in visits during sales such as Black Friday.

In the beginning

In the Web Team we historically have a good grasp of our application performance and resource profiles, through extensive and regular testing, and continuous monitoring. Following some calculations and initial low scale performance tests using Taurus and JMeter in non-production environments we were confident we had a good handle on the number of replicas and the CPU and memory request and limits we would need to meet our requirements for performance on AKS.

Unfortunately, it wasn’t that simple! When running applications on new infrastructure, you have to remember that it’s not just the applications that need tuning — your infrastructure does as well.

SNAT that again…

In our non-production performance testing environment the first few ‘peak load’ tests (where we simulate Black Friday traffic levels) failed in a big way. We would see a really encouraging build up as our load agents gradually increased the QPS against our application, and then suddenly disaster would strike, and our latency metrics would jump to our max timeout levels across all percentiles.

From our application performance monitoring it was clear that the issue was arising during the egress of our application’s requests to upstream services. Jumping on the Azure Portal metrics view for the cluster’s load balancer the root cause was clear — we had hit SNAT port exhaustion.

A graph showing total SNAT port usage over time with a threshold line drawn at the max number of SNAT ports available. The SNAT port usage quickly rises from zero to the max SNAT line and stays there for the duration of a test before resetting to zero. Then SNAT port usage rises sharply again as second time to the just below the max SNAT line and stays there for the duration of a test before dropping back to zero and remaining there.
A graph showing total SNAT port usage over time with a threshold line drawn at the max number of SNAT ports available. The SNAT port usage quickly rises from zero to the max SNAT line and stays there for the duration of a test before resetting to zero. Then SNAT port usage rises sharply again as second time to the just below the max SNAT line and stays there for the duration of a test before dropping back to zero and remaining there.

Source network address translation (SNAT) is the mechanism used in Kubernetes to rewrite the IP address and port of the back-end application to the public IP address and port of your cluster’s public load balancer.

In AKS, when an outbound connection is made from within the cluster the standard load balancer creates a SNAT port — an ephemeral (short-lived) port available for a particular public IP source address. For each public IP address associated with a load balancer, there are 64,000 ports available for SNAT.

SNAT ports are created based on the following five tuple for a connection:

  1. Protocol
  2. Source Address
  3. Source Port
  4. Destination Address
  5. Destination Port

If a second connection is made with a different, unique five tuple (for example, with a different source address to the first connection), then it can share the SNAT port with the first connection. If a further connection is made that is identical to one of the previous connections, then it will be allocated a new SNAT port.

This means if connections are highly unique, you will see good port reuse within the load balancer. However, if you are making a lot of outbound connections that aren’t very unique — they are all coming from the same source and targeting the same upstream destination — then you will see a large volume of SNAT port creation.

A Kubernetes cluster with 4 VMs. The first VM has used it’s port allocation and indicates addition connections fail as the VM has exhausted it’s ports. The second, third and forth VM have used some of their port allocation. The forth VM’s ports are annotated to highlight two ports that are in use, one has connections to API 2 and API 3 as the port can be reused for unique connections. The other port has a connection to just API 3.
A Kubernetes cluster with 4 VMs. The first VM has used it’s port allocation and indicates addition connections fail as the VM has exhausted it’s ports. The second, third and forth VM have used some of their port allocation. The forth VM’s ports are annotated to highlight two ports that are in use, one has connections to API 2 and API 3 as the port can be reused for unique connections. The other port has a connection to just API 3.

Because SNAT ports are a finite resource, given enough requests you can find yourselves running out of ports — this is known as port exhaustion. Ports have a four minute timeout by default until they can be re-used, so once you hit exhaustion it can take up to four minutes to recover, and this is only if your QPS drops sufficiently such that you’re no longer exhausting the ports!

Our PLP application makes up to five different outbound requests per inbound request, and though a few of these are to different destination IPs, the sheer QPS volume was sufficient for us to quickly hit the SNAT port limits. Once this happened, requests would be queued waiting for a new port to become available. Our application level timeouts being far shorter than the four minute SNAT port timeout meant we saw a very high failure rate.

One thing we had missed was that although 64,000 ports are available for SNAT per public IP on the load balancer, Azure only allocates 1,024 ports per Node by default (for your first 50 Nodes). Naively this means an outbound connection load profile for a Node of 1,024 QPS to a single external IP would be sufficient to exhaust ports in a single second for the default configuration!

Fortunately this issue is easily resolved — you just need to increase the number of SNAT ports allocated per Node to a level that can cater for your load profile, check out the Azure Load Balancer documentation for details on setting the allocated ports.

Having provisioned some more public IPs for our load balancer and upped the allocated ports per Node we were able to run a full peak load test against our application with no sweat — awesome! …until we noticed that our latency percentiles were really off where we would like them — we were 2.5x slower than our latency on Cloud Services, not ideal at all!

Rise of the latency

In order to determine the root cause of the slow latency, our team went back to the drawing board with some small-scale controlled experiments to determine the bottleneck. The questions we asked were:

  1. What was the impact of the Node SKU (stock keeping unit —i.e. VM type) on application performance?
  2. What was the optimal trade-off between Pod CPU and replica counts?

The cores clause

For Cloud Services we were using D4_v2 instances which supported 8 vCPU, 28 GiB memory, 8 NICs and 6000 Mbps bandwidth, however for our first attempts on AKS we were running D4_v3 instances for our Nodes which only support 4 vCPU, 2 NICs and 2000 Mbps bandwidth.

What’s more, the v3 series are based on Intel® Hyper-Threading Technology where the four vCPUs are actually backed by only two cores on the bare metal compared to the v2 series which have a 1:1 vCPU:core ratio. This has large implications for the compute performance of the Node, where our choice of D4_v3 had us limited at 33% of the compute power we used to have on Cloud Services.

See below for an abridged extract from Azure’s performance benchmarks:

A comparison table of D4_v3 with 4 vCPUs and an average base rate of 77.8 for the v3 2.40GHz CPU and 82.7 for the v4 2.30GHz CPU, D8_v3 with 8 vCPUs and an average base rate of 146.7 for the v3 2.40GHz CPU and 159.9 for the v4 2.30GHz CPU, D16_v3 with 16 vCPU and an average base rate of 274.1 for the v3 2.40GHz CPU and 300.7 for the v4 2.30GHz CPU, and D4_v2 with 8 vCPUs and an average base rate of 238.7 for the v3 2.40GHz CPU and 248.9 for the v4 2.30GHz CPU.
A comparison table of D4_v3 with 4 vCPUs and an average base rate of 77.8 for the v3 2.40GHz CPU and 82.7 for the v4 2.30GHz CPU, D8_v3 with 8 vCPUs and an average base rate of 146.7 for the v3 2.40GHz CPU and 159.9 for the v4 2.30GHz CPU, D16_v3 with 16 vCPU and an average base rate of 274.1 for the v3 2.40GHz CPU and 300.7 for the v4 2.30GHz CPU, and D4_v2 with 8 vCPUs and an average base rate of 238.7 for the v3 2.40GHz CPU and 248.9 for the v4 2.30GHz CPU.

We can see that in order to guarantee at least the same compute profile on the v3 generation, we needed to use the D16_v3 instances which supported 16 vCPU and backed by the same number of ‘real’ cores (eight) as our existing Cloud Service instances.

A line graph of QPS vs 50 percentile latency per SKU. D4_v3 SKU with one core per Pod performs worst with linear growth of latency from around 140 ms at 32 QPS to around 330 ms at 64 QPS. D16_v3 SKU with one core per Pod performs better with linear growth of latency from around 140 ms at 32 QPS to around 230 ms at 64 QPS. The best performing setup is the D16_v3 SKU with two cores per Pod which was only tested at 56 QPS and 64 QPS going from around 130 ms to 140 ms.
A line graph of QPS vs 50 percentile latency per SKU. D4_v3 SKU with one core per Pod performs worst with linear growth of latency from around 140 ms at 32 QPS to around 330 ms at 64 QPS. D16_v3 SKU with one core per Pod performs better with linear growth of latency from around 140 ms at 32 QPS to around 230 ms at 64 QPS. The best performing setup is the D16_v3 SKU with two cores per Pod which was only tested at 56 QPS and 64 QPS going from around 130 ms to 140 ms.

In our non-production test-rig we were able to confirm that upgrading to the larger SKU greatly improved the latency results, knocking 100ms off our application latency across all percentiles — approximately a 33% performance improvement. It was curious to see that our NodeJS application performance was almost linear with the underlying CPU performance — perhaps not surprising given the NodeJS event-loop architecture.

Small and many vs few and large?

A simple assumption we had was that more replicas would result in improved latency, and we were able to quickly prove that the relationship between replicas and latency is linear (until you hit < 1 QPS).

Line chart of replicas vs latency on D4_v3 with reference data from Cloud Services. Cloud Services: 50 percentile is around 220ms, 75 percentile is around 270ms, and 95 reference is around 380ms. AKS: 50 percentile drops linearly from around 400ms at 160 replicas to around 310ms at 280 replicas and then flat, 75 percentile drops linearly from around 550ms at 160 replicas to around 400ms at 280 replicas and then flat, 95 percentile drops linearly from around 770ms to around 570ms and then flat.
Line chart of replicas vs latency on D4_v3 with reference data from Cloud Services. Cloud Services: 50 percentile is around 220ms, 75 percentile is around 270ms, and 95 reference is around 380ms. AKS: 50 percentile drops linearly from around 400ms at 160 replicas to around 310ms at 280 replicas and then flat, 75 percentile drops linearly from around 550ms at 160 replicas to around 400ms at 280 replicas and then flat, 95 percentile drops linearly from around 770ms to around 570ms and then flat.

So, we were confident that there was always the fallback plan of throwing in more replicas to improve our performance. But we were confident that we could do better than blindly increasing the number of running Pods.

Our next question was to challenge our assumptions around our original choice of Pod sizing. NodeJS is generally a one-thread-per-process kind of gig (though some findings to the contrary are discussed later on!) so the go-to setup is to run a single NodeJS process per core which provides excellent CPU affinity and scales linearly with core count. Indeed, on our Cloud Service instances we were running eight NodeJS processes, one for each of the instances’ cores.

With this logic in mind, our origin Pod sizing was to have one core per Pod. Attempting to decrease this CPU request, for example to half a core per Pod, resulted in a latency degradation of up to (and over in some cases) 50%, even with compensation in doubling the replica count. It was apparent that we certainly didn’t want fewer than one core per Pod!

We followed up this experiment with one in which we increased the Pod size, and for two core Pods we were surprised to see that we generally observed just under a 10% performance improvement on the 75 and 95 percentiles for latency. This was surprising to see at first, but when we recall the v3 architecture whose hyper-threaded 16 vCPUs are actually backed by only eight cores, it is clear that a two core per Pod setup is actually aligning us closer to our one NodeJS process per core philosophy — because the two cores we’re provisioning for the Pod are actually backed by a single core on the box.

Separate to aligning processes with cores, doubling the CPU request per Pod also has the side-effect of reducing the number of Pods per Node by half on average, which in turn halves the pressure on the NICs, SNAT and other Node fundamentals, meaning we observed improvements across both our application’s performance, but also slight improvements on our network requests.

Sizing sorted, what’s next?

In this article I’ve covered how we resolved SNAT issues, and evaluated Node SKU and Pod sizing to maximise our NodeJS application’s performance when running on AKS.

Having found our desired Node SKU and Pod sizes, our latency metrics were starting to look quite desirable. However, interrogating our latency profile metrics we found that we were still around 100ms slower than our Cloud Services for upper percentiles.

In Part 2 of this series, I will be covering our team’s ventures into Kubernetes and NodeJS networking, and how we were not only able to close this 100ms gap, but actually improve our application’s speed by around 30% — stay tuned!

Hi, my name is Craig Morten. I am a senior web engineer at ASOS. When I’m not hunched over my laptop I can be found drinking excessive amounts of tea or running around in circles at my local athletics track.

ASOS are hiring across a range of roles. If you love Kubernetes and are excited by AKS, we would love to hear from you! NodeJS? Same again! See our open positions here.

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and…

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and directed by our writers. Learn about our engineering, our culture, and anything else that's on our mind.

Craig Morten

Written by

27 • London • @asos • Node, Deno, React, Kubernetes • I also tweet stuff

The ASOS Tech Blog

A collective effort from ASOS's Tech Team, driven and directed by our writers. Learn about our engineering, our culture, and anything else that's on our mind.