Under the hood CPU Utilization, CPI and Load Averages

aarti gupta
software under the hood
4 min readFeb 8, 2018

What really is CPU Utilization?

The metric we call CPU utilization is really “non-idle time”: the time the CPU was not running the idle thread. Your operating system kernel (whatever it is) usually tracks this during context switch. If a non-idle thread begins running, then stops 100 milliseconds later, the kernel considers that CPU utilized that entire time.

%CPU can be broken into two components

instruction-retired cycles and stalled cycles, eg, %INS and %STL.

IOWAIT !=Stalling

It is wrong to interpret high %CPU to mean that the processing unit is the bottleneck.

With hyperthreads, however, those stalled cycles can now be used by another thread, so %CPU may count cycles as utilized that are in fact available.

CPU Utilization is often waiting on bus traffic (loading caches, loading ram, loading instructions, decoding instructions) only rarely is the CPU _doing_ useful work.

The context of what you are measuring depends if this is useful work or not. The initial access of a buffer almost universally stalls (unless you prefetched 100+ instructions ago). But starting to stream this data into L1 is useful work.

Aiming for 100%+ IPC is _beyond_ difficult even for simple algorithms and critical hot path functions. You not only require assembler cooperation (to assure decoder alignment), but you need to know _what_ processor you are running on to know the constraints of its decoder, uOP cache, and uOP cache alignment.

The idea here is that it appears that you are limited by the CPU and want to make things faster.

instructions/cycle (IPC)(note the total IPC = no of cores!)

Low level performance can be accurately measured using IPC, instructions per cycle. IPC shows on average how many instructions we were completed for each CPU clock cycle. The higher, the better (a simplification). The above example of 0.78 sounds not bad (78% busy?) until you realize that this processor’s top speed is an IPC of 4.0. This is also known as 4-wide, referring to the instruction fetch/decode path. Which means, the CPU can retire (complete) four instructions with every clock cycle.So an IPC of 0.78 on a 4-wide system, means the CPUs are running at 19.5% their top speed. Newer Intel processors may move to 5-wide.

If your IPC is < 1.0, you are likely memory stalled, and software tuning strategies include reducing memory I/O, and improving CPU caching and memory locality, especially on NUMA systems. Hardware tuning includes using processors with larger CPU caches, and faster memory, busses, and interconnects.

If your IPC is > 1.0, you are likely instruction bound. Look for ways to reduce code execution: eliminate unnecessary work, cache operations, etc. CPU flame graphs are a great tool for this investigation. For hardware tuning, try a faster clock rate, and more cores/hyperthreads.

When doing optimized code, the question “should I optimize for memory or for computing?” comes up often. Should I cache results? Should I use a more complex data structure in order to save memory or improve locality?

IPC is a good indicator on how you should tackle the problem. High IPC means you are may be doing too many calculations, while low IPC means that you should look at your memory usage. BTW, most of the time, memory is the problem.

Linux Load averages

Load average — is the average system load calculated over a given period of time of 1, 5 and 15 minutes.

You probably have a system with multiple CPUs or a multi-core CPU. The load average numbers work a bit differently on such a system. For example, if you have a load average of 2 on a single-CPU system, this means your system was overloaded by 100 percent — the entire period of time, one process was using the CPU while one other process was waiting. On a system with two CPUs, this would be complete usage — two different processes were using two different CPUs the entire time. On a system with four CPUs, this would be half usage — two processes were using two CPUs, while two CPUs were sitting idle.

To understand the load average number, you need to know how many CPUs your system has. A load average of 6.03 would indicate a system with a single CPU was massively overloaded, but it would be fine on a computer with 8 CPUs.

Deeper metrics

When Linux load averages increase, you know you have higher demand for resources (CPUs, disks, and some locks), but you aren’t sure which. You can use other metrics for clarification. For example, for CPUs:

  • per-CPU utilization: eg, using mpstat -P ALL 1
  • per-process CPU utilization: eg, top, pidstat 1, etc.
  • per-thread run queue (scheduler) latency: eg, in /proc/PID/schedstats, delaystats, perf sched
  • CPU run queue latency: eg, in /proc/schedstat, perf sched, my runqlat bcc tool.
  • CPU run queue length: eg, using vmstat 1 and the ‘r’ column, or my runqlen bcc tool.

The first two are utilization metrics, the last three are saturation metrics. Utilization metrics are useful for workload characterization, and saturation metrics useful for identifying a performance problem. The best CPU saturation metrics are measures of run queue (or scheduler) latency: the time a task/thread was in a runnable state, but had to wait its turn. These allow you to calculate the magnitude of a performance problem, eg, the percent of time a thread spent in scheduler latency. Measuring the run queue length instead can suggest that there is a problem, but it’s more difficult to estimate the magnitude

— -

References

http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html

http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

https://www.howtogeek.com/194642/understanding-the-load-average-on-linux-and-other-unix-like-systems/

--

--

aarti gupta
software under the hood

-distributed computing enthusiast, staff engineer at VMware Inc