The Kubernetes CPU Manager: a Ghost Story

Michael Vigilante
Redbubble
Published in
4 min readJul 21, 2020

For a while now, our Kubernetes cluster has been home to a couple of ghosts.

Sure, for the most part everything has been working OK — applications generally perform well, requests get routed through to our website, and people are able to go about their day-to-day lives in peace.

But, out of the corners of our eyes, strange things have been happening. Pods performing erratically, serving most requests quickly but veering towards timeout around the 95th percentile. Odd latency spikes in parts of our network that just didn’t make sense. Pods with plenty of space in their CPU-request budget experiencing short waves of poor performance.

It was all a bit… Spooky.

Undaunted, we paced the cluster looking for clues. Somebody may have uttered those fateful words: “there must be a rational explanation for this somewhere”. We went through all our metrics, followed a couple of red herrings, and eventually came to the conclusion that there was really only one correlation we could make:

Every ghost sighting corresponded to a period of high CPU throttling on the affected host.

CPU throttling occurs in Linux when the scheduler determines a process has used up its entire CPU quota. Quotas are used by Kubernetes to implement CPU limits. The scheduler breaks real time up into periods (100ms long by default); once a process has used up its quota (also defined in ms) of CPU time, it’s not allowed to do any more computation until the next period comes around. This has been known to introduce latency into highly-threaded applications like web servers, and might explain the erratic performance of some pods.

This discovery was a start, but it didn’t explain all the mischief under way — there’s no reason CPU quotas should affect pods that aren’t ever running up against their CPU limits. So, we looked again.

Something else we noticed about our paranormal events was that they tended to occur more frequently on nodes on which CPU time was heavily overcommitted.

We’ve had a habit for a while now of setting our Kubernetes resource limits way higher than our resource requests. This has allowed us to pack more pods onto our nodes, but means that sometimes the containers on a host will try to use more resources than are available. This isn’t great for a lot of reasons — it means there isn’t enough CPU power (or memory) to go around. But we think it’s also specifically very bad for applications with bursty, high-IO workloads — like web servers — with low CPU requests.

Kubernetes implements CPU requests using another scheduler control called CPU shares. Shares are a way to ensure that a process has access to some proportion of all the available CPU power on a system. Unlike quotas, though, they’re designed such that a process’s unused cycles are released back to the system to be used for other things.

Shares are a much older feature and aren’t nearly as well-documented as quotas — some of what follows here is conjecture — but it appears that the amount of power released to the system is based on a rolling average of a process’s load over time. Usually, this wouldn’t be a problem — if an application needs to pick up a couple of its lost cycles from the global pool, it can do that. Unless, of course, another process is already using them. Eventually, things even out — but periods like this, we think, are responsible for the short bursts of poor performance we’ve been seeing in some of our systems.

We thought we were hunting for ghosts in our network, but it turns out greater darkness lurks in the hearts of computers themselves. Having learned this, we’ve changed a couple of things about the way we run our clusters. First, we now reserve CPU space for our system processes — including our network layer — so that we’re sure our overcommitment isn’t starving them of resources (by adding the --reserved-cpus option to our invocation of kubelet)

Second, we enabled a Kubernetes feature: the static CPU manager policy. This allows containers within our pods to take over one or more CPUs for themselves, without needing to worry about shares or throttling. In order to take advantage of this, a container needs to meet two conditions: it must be in a pod with a “guaranteed” Quality of Service class (all this really means is that the pod’s resource requests must be identical to its resource limits) and the container must request an integer number of CPUs. Any container meeting these requirements will now run on its own dedicated CPU, free from the wiles of the scheduler.

When we rolled this out, several of our applications already met the requirements and began to see benefits. We helped a couple of teams get our other large applications across, and with that done we immediately saw improvements within our network. Latency is reliably low, and unexpected errors have reduced to near-zero. Since we moved most of our larger applications over to dedicated CPUs, we’re also much less overcommitted now.

This is how we managed to scare away the spectres that lurked in our network — a whole lot of time spent measuring and searching and reading. I hope, wherever we’ve sent them, they are at peace.

References:

--

--