Linux troubleshooting: CPU analysis

Published in

Saltside Engineering

5 min readMay 3, 2022

In this article we’ll be walking through a few things to do when debugging CPU issues on a Linux server. We’ll be walking through a few standard tools available on (almost) all Linux machines that can help you figuring out the root cause.

To start with, if you don’t know how many CPUs the server has already, you can figure that out by taking a look at lscpu.

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          8
...

Load average over time

A good thing to take a quick look at when you log on to the server is to see how the load has developed over time using the uptime command.

$ uptime                                                                                                                                                                 11:19:45 up  1:43, 37 users,  load average: 2.44, 2.35, 2.40

The last three numbers indicate the load average for the last 1, 5, & 15 minutes.

What to look for?

Numbers increasing for each column means the load was higher before than it is now. You may be too late to see the issue in real time.
Numbers decreasing for each column means the load is higher now than earlier and may indicate an issue with CPU load or high I/O wait times.

High level overview of CPU consumption

vmstat -w 1 is a great way to understand what type load is hogging the processor. Let’s dive into the specific columns and what to look for below.

$ vmstat -w 1
--procs-- ... ---swap-- -----io---- -system-- --------cpu--------
   r    b ... si   so      bi    bo   in   cs  us  sy  id  wa  st
   8    0 ...  0    0     159   116 1049  520  31  10  58   0   0

The output has been cropped of memory details for the sake of readability.

What to look for?

Application level “slowness” can appear before the CPU reaches 100%. From levels around 80% I think it makes sense to be a little worried about the CPU.

vmstat -w 1 gives you a per second update about the CPU usage. What is important is not 1 specific second, but the overall trend.

r : Processes running or waiting for runtime. If this number is consistently higher than the number of CPUs, then you have saturized the CPU. Take a look at the cpu column for details.

us : Percentage of time spent executing user space instructions. High number here means an application is over-utilizing the CPU.

sy : Percentage of time spent executing system / kernel space instructions. Typically this number should be lower than 20%. A high number here could indicate issues in the kernel or more likely in a driver.

wa : Time spent waiting for I/O. A consistently high number here indicate an issue with an I/O device. Take a look a the disk I/O analysis later in the follow up article.

st : Percentage of time stolen from the CPU. This happens in environments with virtual machines where one machine is stealing CPU cycles from the other. If that happens, you may have a noisy neighbor issue and — depending on your infrastructure — you could move the VM to another physical host.

Per CPU analysis

Sometimes the total load on the CPU that we saw using vmstat -w 1 is acceptable, but you still have applications that struggle to get enough CPU time. One such case is when you have single threaded applications, since they would be dependent on execution in one CPU.

mpstat -P ALL 1 will give you statistics similar to vmstat but with a per CPU breakdown.

$ mpstat -P ALL 1                                                                                                                                                                                         15:11:50
Linux 5.13.0-40-generic (tsunami)       05/03/2022      _x86_64_        (8 CPU)CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %idle
all    6.67    0.00    2.52    0.00    0.00    0.13    0.00  90.69
  0    4.00    0.00    2.00    0.00    0.00    0.00    0.00  94.00
  1   10.10    0.00    4.04    0.00    0.00    0.00    0.00  85.86
  2    8.08    0.00    1.01    0.00    0.00    0.00    0.00  90.91
  3    4.04    0.00    1.01    0.00    0.00    0.00    0.00  94.95
  4    4.04    0.00    3.03    0.00    0.00    1.01    0.00  91.92
  5   10.00    0.00    1.00    0.00    0.00    0.00    0.00  89.00
  6    7.07    0.00    3.03    0.00    0.00    0.00    0.00  89.90
  7    6.00    0.00    5.00    0.00    0.00    0.00    0.00  89.00

Output cropped in the interest of readability.

What to look for?

We have already been looking at the overall data using vmstat -w 1, so the key to focus on here is if there is a single CPU that seems to be struggling more than any other. If there is; start looking at if there is any one PID which is using a lot of CPU or think about if one of “your” applications on the server might be using extra resources for any specific reason.

You might wonder why I’m using vmstat -w 1 instead of just mpstat 1 to get the overall CPU information? One reason is I use vmstat to get also a quick picture of memory, swap and process execution. So it gives me a more holistic view of what’s going on.

Process CPU analysis

Alright, we’ve gotten this far by looking at the CPUs distribution of load. But if we’ve identified that there is some process hogging resources, how do we find which one is? A common (and pretty good) answer is to use top. But there is one issue with top it’s difficult to copy the output to a scratch pad where you keep details about your debugging or if you’d want to send the output to someone else.

To get an output which is easy to read and copy friendly, run pidstat | head -n 20 or e.g. pidstat -u 5 60to get updates every 5 seconds for the next 6 seconds.

$ pidstat | head -n 20
Linux 5.13.0-40-generic (tsunami)       05/03/2022      _x86_64_        (8 CPU)UID       PID    %usr %system  %wait    %CPU   CPU  Command
  0         1    0.00    0.01   0.00    0.01     7  systemd
  0         2    0.00    0.00   0.00    0.00     1  kthreadd
  0        12    0.00    0.00   0.00    0.00     0  ksoftirqd/0
  0        13    0.00    0.06   0.02    0.06     7  rcu_sched
  0        14    0.00    0.00   0.00    0.00     0  migration/0
  0        19    0.00    0.00   0.00    0.00     1  migration/1
  0        20    0.00    0.00   0.00    0.00     1  ksoftirqd/1

What to look for?

What you’d want to find out here is if there are some processes that over time use a lot of CPU. It’s worth looking at:

%usr to find high user space time allocation. This would indicate an application in need of scaling for example.

%system to find high system / kernel space time allocation. Remember that this could be due to driver issues or other similar kernel level problems.

%wait would yet again indicate I/O latencies.

For what it’s worth, you could also map the CPU number here with the CPU number from mpstat -P ALL to get an idea of how that CPU is fairing.

The Linux troubleshooting series:

Linux troubleshooting: CPU analysis

Load average over time

What to look for?

High level overview of CPU consumption

What to look for?

Per CPU analysis

What to look for?

Process CPU analysis

What to look for?

Written by Sebastian Dahlgren