Linux’s SRE Golden Signals
Part of the How to Monitor the SRE Golden Signals Series
Most people are interested in the performance and stability of their application and its underlying services, such as those covered in these articles. But many of those services rely on Linux underneath them, as their core resource for CPU, RAM, Network, and IO.
Thus it’s probably wise to get the Linux SRE Signals themselves, especially in cases where the upper-level service is very sensitive to underlying resource use (especially IO).
Even if you don’t alert on these things, they provide valuable details for observed changes in higher-level metrics, such as when MySQL Latency suddenly rises — did we have changes in the SQL, the Data, or the underlying IO system?
For Linux, we are mostly interested in CPU and Disk, a bit of memory (for saturation), and for completeness, the network. Networks are rarely overloaded these days, so most people can ignore it, at least in terms of SRE Signals and alerting.
Though you could monitor network IN/OUT as a network-level Request Rate Signal, since network flow anomalies are interesting, and correlating this to other request metrics can be useful in troubleshooting).
For CPUs, the only real signals are Utilization & Saturation, though Latency is interesting while Errors, Latency, and Requests aren’t very relevant. Thus we have:
- Latency — It’s not clear what this means in Linux, i.e. could be the time taken to do something, or the time waiting in the CPU queue to do something (forgetting about swapping or IO delays).
Unfortunately we have no way to measure the time it takes to do the may varied things Linux does, nor can we measure the time all tasks spend waiting in the CPU queue. We had hoped we could get this from /proc/schedstat but after lots of work we have concluded we cannot.
- Saturation — We want the CPU Run Queue length, which becomes > 0 when things have to wait on the CPU. It’s a little hard to get accurately.
We’d love to directly use the Load Average, but unfortunately in Linux it also includes both CPU and IO, which really pollutes the data. However, it’s easy to get and can be useful for Saturation if your service uses CPU only, with little/no IO, such as an App Server (but watch out for large logs being written to disks). A Load Average > 1–2 x CPU count is usually considered saturated.
If we want the real queue, i.e. exclude any IO, it’s harder. The first problem is that all the tools that show this data do not average it over time, so it’s an instantaneous measurement and very noisy. This applies to vmstat, sar -q, and /proc/stat’s procs_running, etc.
The next challenge is what Linux calls the run queue actually includes what’s currently running, so you have to subtract the number of CPUs to get the real queue. Any result > 0 means the CPU is saturated.
The final issue is how to get this data from the kernel. The best way is in the /proc/stat file. Get the “procs_running” count and subtract the number of cpus (grep -c processor /proc/cpuinfo) to get the queue.
That’s still an instantaneous count, so it’s very noisy. There seems no tool to same and aggregate this over time, so we wrote one in Go, called runqstat — see it on Github (where it’s still under development).
Steal — You can also track CPU Steal % as this also means the CPU is saturated, from the perspective of the Hypervisor, though you should also see an immediate rise in the CPU Run Queue, too, unless you have a single threaded load like Nginx or HAProxy or Node.js where this is hard to see.
Note for single-threaded loads, this whole Saturation metric is less useful (as you can’t have a queue if there is nothing else to run). In that case, you should look at Utilization, below.
- Utilization — We want the CPU %, but this also needs a bit of math, so get and sum CPU %: User + Nice + System + (probably) Steal. Do not add iowait % and obviously don’t add idle %.
It’s critical to track CPU Steal % on any system running under virtualization. Do not miss this, and usually you want to alert if this is much above 10–25%.
For Docker and other containers, the best saturation metric is harder to determine and it depends if there are any caps in place. You have to read the /proc file system for each container and from the file cpu.stat for each, get the nr_throttled metric which will increment each time the Container was CPU-throttled. You probably should delta this to get throttles/second.
For Disks, we map the SRE signals to the following metrics, all from the iostat utility, though your monitoring system may extract these items directly from /proc/diskstats (most agents do this).
You can get this data for all disks, just the busiest disk, or the disk where your service’s data sits, which is often easiest since you should know where it is and focusing your alerting & dashboards on that avoids having to look at the other less-informative disks in the system.
- Request Rate — The IOPS to the disk system (after merging), which in iostat is r/s and w/s.
- Error Rate — There are no meaningful error rates you can measure with disks, unless you want to count some very high latencies. Most real errors such as ext4 or NFS are nearly fatal and you should alert on them ASAP, but finding them is hard, as usually you have to parse the kernel logs.
- Latency — The read and write times, which in iostat is Average Wait Time (await), which includes queue time, as this will rise sharply when the disk is saturated.
It’s best to monitor read and write response time separately if you can (r_await & w_await) as these can have very different dynamics, and you can be more sensitive to actual changes without the noise of the ‘other’ IO direction, especially on databases.
You can also measure iostat Service Time as a way to look at the health of the disk subsystem, such as AWS EBS, RAID, etc. This should be consistent under load & IO-sensitive services will have real problems when this rises.
- Saturation — Best measured by IO Queue depth per device, which if you get it from iostat is the aqu-sz variable. Note this includes IO requests being serviced, so it’s not a true queue measurement (the math is a bit complex.
Note the iostat util % is not a very good measure of true saturation, especially on SSD & RAID arrays which can execute multiple IO request simultaneously, as it’s actually the % of time at least one request is in-progress. However, if util % is all you can get, it can be used for simple monitoring to see if there are real changes compared to baselines.
- Utilization — Most people use the iostat util%, but as noted above, this is not a good measure of real utilization for other than single-spindle magnetic disks.
For memory, we only care about Saturation and Utilization.
- Saturation — Any swapping at all means RAM saturation, though today most cloud VMs don’t use any swap file, so this is slowly becoming useless. The next best metric is really just Utilization, below.
You should track OOM (Out-of-Memory) errors if at all possible, which are issued by the kernel itself when it is totally out of RAM or swap for a process. Unfortunately this can only be seen in the kernel logs (though in kernel 4.13+ it’s also in /proc/vmstat, the oom_kill counter), so ideally you can send these to a centralized ELK or SaaS-based log analyses system that can alert on this.
Any OOM error implies serious RAM saturation, and is usually an emergency, as it usually kills the largest RAM user, such as MySQL.
Note that if there is swap available and kernel swappiness is not set to 1, you will often see swapping long before RAM is fully utilized. This is because the kernel will not minimize the file cache before it starts swapping, and thus you can saturate RAM long before RAM is fully used by work-producing processes.
- Utilization — Calculate RAM use as (Total RAM- free-buffers-cache), which is the classic measure of RAM in use. Dividing by the total RAM gives the % in use.
For networks, we are interested in throughput, which is our ‘requests per second’ in the sense of bandwidth.
- Rate — Read and Write bandwidth in bits-per-second. Most monitoring systems can get this easily.
- Utilization — Use Read and Write bandwidth divided by the maximum bandwidth of the network. Note maximum bandwidth may be hard to determine in mixed VM/Container environments when traffic may use both the hosts’s localhost network & the main wired network. In that case, just tracking Read and Write bandwidth may be best.