The Java CPU usage observer effect

Milan Mimica
Infobip Developers
Published in
3 min readJul 23, 2020

A mere observation of a phenomenon inevitably changes that phenomenon.

System CPU Usage graph

One of the basic metrics of your service goes berserk. System CPU usage graph suddenly becomes all spiky. It’s switching between 0 and 100% most of the time. No business metrics show anomalies so you continue with your daily routine. Ignoring the crazy metric is a proven strategy to get things delivered and not getting stressed out for nothing. It wouldn’t be the first time things somehow solve by themselves. But not this time.

Nothing on the surface

You dig deeper and find that the metric is produced by OperatingSystemMXBean::SystemCpuLoadMXBean. The value is exposed to prometheus by jmx_exporter, together with all other java.lang<type=OperatingSystem>metrics. To your amazement, other metrics look fine. Even ProcessCpuLoad plots a nice-looking line. You ssh to the VM, run htop and CPU usage also looks fine. Development is fast-paced. There have been many recent changes in both the service and the running environment. It’s impossible to track down which change exactly broke the System CPU graph. Besides, it’s Java — you were promised platform independence and backward compatibility!

Deep dive into Hotspot JVM

System CPU usage is calculated in OperatingSystem.c as follows: it reads the first line from /proc/stat, where the system keeps its ticks. A tick is an arbitrary unit for measuring internal system time. There are independent counters of ticks passed doing “idle” time, “user” time, “system” time, and so on. More ticks, more CPU time was spent in the category. Each time you ask Java for an estimation of, as they put it “recent CPU usage”, it reads current tick counters, subtracts the ticks from the last read, and gives you the ratio to total ticks. In pseudo code:

ticks = read("proc/stat")user_diff = ticks["user"] - prev_ticks["user"]
system_diff = ticks["system"] - prev_ticks["system"]
total_diff = sum(ticks) - sum(prev_ticks)
if (total_diff != 0)
cpu_load = (user_diff + system_diff) / total_diff
else
cpu_load = 0
prev_ticks = ticks

There is an obvious flaw in this implementation: it calculates average CPU usage since the last invocation of the method. If you call it twice in a row, the second invocation is likely to return a value skewed by low-resolution: a zero, a one, or 0.5 if you’re lucky. Exactly what the graph started to plot!

Look Once, Look Again

The hidden cause of this discrepancy is the rename of the metric in JDK 14 from SystemCpuLoad to CpuLoad. The old metric was not removed, only deprecated in the API. JmxCollector pulls and publishes all available metrics, effectively calculating the same CPU usage twice in a row. The value was corrupted by merely measuring it — twice.

In MBean API there is no attribute for deprecation, and that’s a real problem. A generic tool has no way of ignoring deprecated metrics. I see no workarounds for this. Even Oracle’s own Mission Control is displaying different values for SystemCpuLoadand CpuLoad. Any generic tool will have the same issue. It’s not safe to assume which metric is going to be right or wrong. It’s random, though seems consistent. Unfortunately, jmx_exporter does not support the blacklisting of individual MBean values.

--

--