Network Timeouts and CPU Limitations in Azure Kubernetes Service

How under-provisioned applications behave, and the kinds of errors that are caused by limited CPU

Danielle Fenske

Published in

PayScale Tech

4 min readSep 3, 2019

Intro

We have been plagued by a mysterious issue in our production environment for many weeks now — and now that we have solved the issue, we wanted to document the solution for anyone else out there that is struggling with a similar problem.

Application Details

We built a Next.js application running on a custom Express server in Docker on Azure Kubernetes Service. Our application relies heavily on data that is fetched from an API. The API runs on an Azure Web Application and returns large (~100KB) JSON blobs. Our app is running behind Azure Application Gateway.

How the issue manifested

We were puzzled by the combination of issues we saw — at each level of the pipeline, we were seeing different problems.

After the app had been spun up, as it got more and more traffic, we would start to see lots of network timeouts while fetching data from the API.

Lots of network timeouts while calling the API (seen in an Azure Log Analytics query)

The network timeouts caused the readiness probe for our pods in AKS to constantly fail due to timing out. AKS would take those pods out of rotation, causing the other pods to be overloaded with more traffic. This caused a cascading failure to all our pods, so out of all our pods, about half of them were unhealthy at any given time, and they just kept cycling from healthy to unhealthy.
Increase in 5XX errors in the Azure Application Gateway

Spike in failed requests in Azure Application Gateway

Decline in requests to the application itself

Decline in total requests that were making it to the application (seen in an Azure Log Analytics query)

No errors or high response time observed in the API logs

What we tried

At first we thought that the event loop was being blocked. We realized we were using console.log, which is synchronous and not suitable for a production environment. We replaced console.log with Winston, an asynchronous logging framework. For a few days, we thought that this solved the problem.

When we did another deployment, everything started to crash again. This time we tried increasing the number of pods — all the way up to 64. This did not help, and instead we now had 32 pods unhealthy at any given time…

We created a load tester that replays Cloudfront logs on a specified domain, so we were able to reproduce the issue locally. We tried many different performance optimizations and tried commenting out code, bit by bit, while running the load tester, but nothing seemed to prevent the network timeouts.

We noticed that CPU usage was high and spiky while load testing, so we tried to increase the CPU requests per pod. There are two CPU values you need to configure in your YAML file when provisioning an application on AKS: one is requests and one is limits. These names may sound vague but they are just the lower and upper bounds of how much CPU you want each pod to use.

So we started with this configuration:

resources:  requests:    cpu: .5    memory: 1Gi  limits:    cpu: 2    memory: 2Gi

This means we want at least half a CPU for each pod and we want each pod to take up no more than 2 CPUs. When we tried scaling out from 8 pods to 64, this meant that we were requesting 0.5 * 64 = 32 CPUs from the cluster.

Then we wanted to try bumping the CPU request up to 1 CPU. When we tried this with 64 pods, we got an error saying “Insufficient CPU”, meaning we were requesting more CPU than the cluster had available.

This led us to increase the number of nodes in our cluster from 7 to 15. We were then able to deploy our application with CPU requests = 1, and CPU limits = 2, since we now had enough resources available in the cluster for this configuration. This solved our problem! It allowed us to even reduce the number of pods back to 8 and the app stayed healthy.

Solution (and Explanation)

Solution: Increase the number of nodes in your AKS cluster.

Explanation: Scaling our number of nodes in our AKS cluster up to 15 allowed each pod to get enough CPU for it to function smoothly. One question I had was “why did it struggle before, even when we said each pod could take up to 2 CPU for itself?” We realized that before we increased the number of nodes, we had lots of pods that said “I can go up to 2 CPUs if I need”, but we didn’t have 2 * number of pods CPU available to give out. So each pod was never able to get close to that upper limit of CPU allocated for itself, since there just weren’t enough resources to be provisioned for each pod.