CPU Frequency Scaling

Emmanuel Stephan
8 min readApr 9, 2016

--

Modern CPUs on high-end, general purpose servers usually allow “CPU frequency scaling”. This feature enables the operating system to control the CPU frequency to achieve various goals, such as performance or power saving. Recently, I optimized the latency of a large query on our graph database, an in-house graph database currently under development. I was not very familiar with CPU frequency scaling, as my previous experience with code optimization did not include working on servers with that feature. The effort turned into a solid learning experience.

The setup

I focused exclusively on optimizing the latency of a single threaded C++ process, on RHEL. The server had 2 NUMA nodes with 8 cores and 16 CPUs each (note that in Intel’s nomenclature, “cores” have “CPUs”), for a total of 32 CPUs, and each NUMA node has 256 GB of RAM. With this setup, we hoped to keep track of the effect of each of our code improvements on the latency. Ideally, as we introduced code improvements, the latency would go down, till we reached our target of 0.5s. When we started, the latency was at about 2.6s.

Here are some details about the server used, obtained by running lscpu and numactl -H:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0–31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Stepping: 2
CPU MHz: 2601.000
BogoMIPS: 5200.57
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0–7,16–23
NUMA node1 CPU(s): 8–15,24–31
available: 2 nodes (0–1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 262036 MB
node 0 free: 49732 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 262144 MB
node 1 free: 241687 MB
node distances:
node 0 1
0: 10 21
1: 21 10

Measuring the query latency

My first task was to take stock of what the query’s latency actually was. One of my colleagues had previously reported a single-point measurement of 2.65s, without more details. I wanted to get some idea for the variability of that measure, knowing from previous experience that a single number can be deceptive. I collected batches of 480 measurements and started looking at the data. I was quite surprised when my first batch returned with a minimum latency of 2.6s, but a maximum of 3.75s!

I wanted to reduce the variability, which seemed too big since we intended to introduce a series of code improvements that would be on the order of 0.1–0.2s. Also, our final target was 0.5s, when the initial latency variability was above 1 full second. My first thought was that I simply didn’t have enough data points, and that having more would reduce the variability. Alas, that proved to be a dead-end, and the difference between the max and the min latencies didn’t change.

Suspecting the distribution of the latencies had interesting features, I plotted histograms. I was expecting a long tail to the right, reasoning that the server was running the query in 2.6s most of the time, but sometimes much slower because it was busy doing other things. To my surprise, the distribution had a long tail, but to the left! Other histograms had 2 or even 3 modes! Here are two typical histograms. The solid black line is the mean, and the dotted lines are one standard deviation away from the mean.

At that point, someone in the team suggested that I plot the latency over time. This revealed a pattern: the server would sometimes be fast, sometimes slow, and sometimes quite erratic, but it seemed with one behavior for a while, then changed to another behavior. It was not random, as is apparent on the plots below. The first plot below corresponds to the histogram on the left just above, and the second plot corresponds to the histogram on the right just above. The horizontal axis counts the number of times the query was run, 480 times back to back in this case.

Suspecting that the slow downs were caused by other processes running on the server, I turned to top and ps. There were actually quite a few processes, mostly telemetry processes that are standard issues on our production servers. I killed as many I could, but the dips, troughs and plateaus in the latency plots did not go away. We thought that maybe the variations were due to the OS swapping our reader process from one CPU to another. To control for that, I used taskset -c 21 to pin the C++ process to a single CPU, choosing 21 rather than 0 in case CPU 0 had a privileged role for the OS. This was the first time I saw the variability actually reduce. It was reduced by half! In the graph below, the standard deviation (the dotted lines) is clearly much smaller when the process is pinned to CPU 21. I later discovered that it seems to be enough to pin the process to a NUMA zone (the server has 2). Pinning to a single CPU might be better, but in our case, the benefits seemed negligible.

Here are the histograms of the latency, with and without CPU affinity. We can clearly see again that the variability is reduced when pinning the process to a specific CPU.

I still wanted to reduce the variability. That’s when someone suggested that the “CPU governor” might be the cause of the variability we were still seeing, and we started getting interested in the cpupower command. It turns out that the “CPU governor” has several different settings to optimize the performance and the power consumption of a server. Also, the CPUs have “boost” modes that can temporarily over-clock a CPU. Here is a typical example of the output of cpupower when used to monitor the frequencies of the CPUs on a 32 CPU box. As you can see, CPU 21 is very busy, currently on an excursion at 3.395 GHz, whereas CPU 25 is idling at 1.51 GHz.

Cpupower also allows setting the frequency of the CPUs. I ran a bunch of experiments to study the variability of the latency for our query at various CPU frequencies. I varied the frequency from 1.2GHz to 2.6GHz (the manufacturer recommended operating range) in increments of 0.2 GHz and also finally set the CPU governor at the “perf” setting, which is the mode where overclocking is allowed for short bursts. I obtained the following distributions at the different frequency settings, and it shows that there is more variability in the latency as the frequency is higher. The dotted lines connect the min and the max of the observed latency (outliers excluded). The blue shape is the actual shape of the distribution, on its side.

Here are the individual histograms at each frequency separated out, with the usual convention about the dotted lines delineating one standard deviation away from the mean. The increase of the standard deviation is quite obvious from this series of histograms.

My interpretation is that as the CPU frequency is set higher, the CPU heats up more, causing its actual speed to vary more. When the CPU is “cold” (around 1.2 GHz), the variability is minimal. In the “perf” setting, the variability is the greatest, and the distribution has a long tail to the left, because the CPU tries to run faster as much as it can, but has to fall back to a slower setting constantly, under penalty of burning up… At the “hot” end of the settings (in “perf” mode), the variability ((max — min)/mean) is about 12%, but at the “cold” end of the settings, we have only 2% of variability. We couldn’t eliminate that residual variability, but the margin of error on our measurements of latency is now much more manageable. I was told that this “intrinsic” variability apparently is due to the infernal details of the CPU, which is apparently divided into multiple “clock domains” which are not necessarily synchronous. That might the topic of a follow-up blog post.

One question remained though. We could set the server to 1.2 GHz to accurately (within 2%) estimate the improvements introduced by code checkins, but would a measurement at 1.2 GHz be “predictive” in some sense of the performance in the production setting, which was to set the CPU at 2.6 GHz? To explore this question, I compared 2 versions of the code, which differed by one checkin, and plotted the percentage latency reduction as I varied the frequency of the CPU. The results indicate that the percentage improvement is magnified at 1.2 GHz compared to 2.6 GHz, at least for the improvement made in this checkin — it’s probably hard to generalize, since a random checkin could change many, many different things.

One more note about CPU frequency scaling. This feature also prevents the CPUs from over-heating, so the hotter the server (and its environment), the more variability in the latencies there will be. We looked at temperatures on the box with sensors, but in our case, the temperature stayed at 53C, with a “high” temperature at 77C and a critical temperature of 85C which we never reached. Colleagues told me that depending on where the box sits in the rack in the colo, since hot air rises, boxes towards the top of the rack usually run hotter than those at the bottom.

Links

--

--