CPU Usage Demystified

Published in

Remitly

5 min readJul 20, 2021

Author: Łukasz Marszał, Infrastructure Engineer

We all monitor CPU usage of our applications. When we see low cpu usage, we are tempted to to decrease resources or put more workloads on the machine — but is that the right approach? In this article I’ll attempt to answer this question.

What is cpu usage?

When talking about cpu usage you usually see a nice graph saying “your app is using 50% cpu”. But what does that actually mean?

If you zoom in close enough you’ll see that CPU usage is in fact 0 or 1. Your app is either scheduled on a cpu (e.g., processing http request) or idle. So how is that nice graph being calculated? Operating systems provide a metric called “cpu usage seconds” (number of seconds your process is being executed) — you calculate the rate of that in time and you receive the percentage.

So does this mean that unless I’m using 100% cpu, my app is performing at the optimal level? No.

Inter arrival rate

Let’s assume you have an app — say an http server — running on a single CPU computer, and each request takes 20ms cpu time to execute.

What happens if your app receives 2 requests (A and B) in the 100ms window? Each of them will take 20ms and your cpu consumption will be 40%. That’s great — that’s the optimal performance.

But what if request B arrives only 10ms after request A? Well — your cpu is still busy processing request A, so request B will have to wait an extra 10ms. Your cpu consumption is still 40% (40ms out of 100ms window), but now your response times are 20ms and 30ms (sic!).

30ms is not so bad — but what if yet another request arrives 10ms later? You theoretically have plenty of headroom, but it’ll take 40ms to process it. And what if all three arrive at the same time? It’ll take 60ms to process the slowest one.

Waiting time

This all reminds me of waiting for a bus — you never know if it’ll be late and if so, by how much. Yet, everyone makes it to work on time. People are sometimes 5min late, sometimes 5min early — but rarely is someone 1h late.
It’s the same with the inter arrival rate. Unless you have a very busy server (e.g., a natural catastrophe or first snow) it’s quite unlikely that incoming requests will have to wait long.

But “unlikely” and “long” do not sound very scientific.

Assuming your inter arrival rate and processing time is a random variable, you should be able to tell its distribution. Let’s assume it’s a Poisson Distribution — which is likely for human-facing interfaces. As in the previous example — let’s assume that your request needs on average 20ms of a single cpu to be processed. Now let’s use an M/M/c queueing model to do the math. For those who want to read more I recommend A.O. Allen (IBM) article called “Queueing Models of Computer Systems” from 1980.

Simulation of 1 instance, 1 processor:

As you can see — even with 60% cpu utilization your mean response time (P50) grows significantly over its minimum. P90 grows even faster. And as CPU utilization is getting closer to 100%, response times are growing exponentially.

When you compare this with your service-level-objectives (SLO) you’ll actually see what “not so busy”, “unlikely” and “long” mean.

More resources or more replicas?

Let’s play with M/M/c models a little bit more. Let’s check if it’s better to add more cores to your machine or add more replicas of the same size? The assumption here is that your LoadBalancer would use the round-robin algorithm for distributing requests.

Simulation of 40 RPS, multiple instances, 1 core per instance:

Simulation of 40 RPS, 1 instance, multiple cores:

Assuming your SLO for P90 is 35ms you’ll either need five 1-CPU instances or one 2-CPU instance. So the bigger instance wins. Also — when using bigger instances, they’ll be better utilized.

To explain this phenomenon you may want to imagine a queue in a store with two checkouts. If there are two separate queues (two instances behind LoadBalancer) you may choose a queue that goes slower. If there’s only one queue for the two of them you’ll be served faster (variances in processing time and arrival time matters less).
More detailed explanations can be found in Allen or other queueing theory publications.

Summary

Unfortunately for us the model used for calculations above is too simple for almost all current computer software. There’s a lot more than a CPU that affects processing time, limits your throughput and causes queueing — databases and other dependencies, memory and networking to name a few. Nevertheless, regardless of this limitation it’s close enough to reality that the above presented symptoms (but not exact numbers) actually do occur in production.

The important takeaways are that (1) your app may be underperforming even though your CPU consumption is way within the norms. (2) P90 increase may be one of the first symptoms of application performance problems.