When I started working with kubernetes at scale I began encountering something that didn’t happen when I was just running experiments on it: occasionally a pod would get stuck in pending status because no node had sufficient cpu or ram available to run it. You can’t add cpu or ram to a node, so how do you un-stick the pod? The simplest fix is to add another node, and I admit resorting to this answer more than once. Eventually it became clear that this strategy fails to leverage one of kubernetes greatest strengths: its ability to efficiently utilize compute resources. The real problem in many of these cases was not that the nodes were too small, but that we had not accurately specified resource limits for the pods.
Resource limits are the operating parameters that you provide to kubernetes that tell it two critical things about your workload: what resources it requires to run properly; and the maximum resources it is allowed to consume. The first is a critical input to the scheduler that enables it to choose the right node on which to run the pod. The second is important to the kubelet, the daemon on each node that is responsible for pod health. While most readers of these posts probably have at least a basic familiarity with the concept of resources and limits, there is a lot of interesting detail under the hood. In this two-part series I’m going to first look closely at memory limits, and then follow up with a second post on cpu limits.
Resource limits
Resource limits are set on a per-container basis using the resources
property of a containerSpec, which is a v1 api object of type ResourceRequirements. Each object specifies both “limits” and “requests” for the types of resources that can be controlled. Currently that means cpu and memory. A third type of resource, ephemeral storage, is in beta but I will come back to that in some future post. For most of us the place we will encounter resource limits is in the specification of a deployment, statefulset or daemonset, each of which contains a podSpec with one or more containerSpecs. Here’s an example of a complete v1 resources object in yaml:
resources:
requests:
cpu: 50m
memory: 50Mi
limits:
cpu: 100m
memory: 100Mi
This object makes the following statement: in normal operation this container needs 5 percent of cpu time, and 50 mebibytes of ram (the request); the maximum it is allowed to use is 10 percent of cpu time and 100 mebibytes of ram (the limit). I’m going to talk a lot more below about the difference between requests and limits, but generally speaking requests are important at schedule time, and limits are important at run time. Although resource limits are set on each container, you can think of the limits for a pod as being the sum of the limits of all the containers in it, and as we’ll see that relationship is maintained lower down in the system as well.
Memory limits
I’m tackling memory limits first because in many ways they are the simpler of the two. One of my goals here is to show how limits are implemented in the system, as kubernetes delegates to the container runtime (docker/containerd in this case), and the container runtime delegates to the linux kernel. Showing that with memory first may make it clearer when we talk about cpu limits later. First let’s revisit the example above, with just the memory limits:
resources:
requests:
memory: 50Mi
limits:
memory: 100Mi
The unit suffix Mi
stands for mebibytes, and so this resource object specifies that the container needs 50 Mi and can use at most 100 Mi. There are a number of other units in which the amount of memory can be expressed. To see how these values are used to control the container process, let’s first create a really simple pod with no memory limits at all:
$ kubectl run limit-test --image=busybox --command -- /bin/sh -c "while true; do sleep 2; done"
deployment.apps "limit-test" created
Using kubectl we can verify that kubernetes created the pod with no limits:
$ kubectl get pods limit-test-7cff9996fc-zpjps -o=jsonpath='{.spec.containers[0].resources}'
map[]
One of the cool things about kubernetes is that you can always jump outside the system and look at things from that perspective. So let’s ssh to the node and see how docker is running that container:
$ docker ps | grep busy | cut -d' ' -f1
5c3af3101afb
$ docker inspect 5c3af3101afb -f "{{.HostConfig.Memory}}"
0
The container .HostConfig.Memory
field corresponds to the --memory
argument to docker run
and a 0 value means no limit has been set. What does docker do with that value? In order to control the amount of memory that a container process can access docker configures a property of a control group, or cgroup for short. Cgroups were added to linux in kernel version 2.6.24, released in January of 2008. They are a big topic, so for the moment let’s say that a cgroup is a container for a set of related properties that control how the kernel runs a process. There are specific cgroups to control memory, cpu, devices, etc. Cgroups are hierarchical, meaning that each cgroup has a parent from which it inherits properties, all the way up to the root cgroup which is created at system start.
Cgroups are easy to inspect using the /proc and /sys pseudo file systems, so it’s a simple exercise to see how docker has configured the memory cgroup for our container. Inside the pid namespace of a container the root process has pid 1, but outside that namespace it has a system-level pid that we can use to find its cgroups:
$ ps ax | grep /bin/sh
9513 ? Ss 0:00 /bin/sh -c while true; do sleep 2; done
$ sudo cat /proc/9513/cgroup
...
6:memory:/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574
I’ve left out everything but the memory cgroup, which is the one we care about. As you can see it’s a path — there’s that hierarchy I mentioned above. A few things should stand out here: first the path begins with the kubepods
cgroup, so our process will inherit everything in that group, as well as stuff from the burstable
group (where kubernetes places processes from pods in the burstable QOS class) and a group representing our pod that is mostly used for accounting. The last component of the path is the actual memory cgroup of our process. To see the details we have to append the path above to /sys/fs/cgroups/memory
, which leads to:
$ ls -l /sys/fs/cgroup/memory/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574
...
-rw-r--r-- 1 root root 0 Oct 27 19:53 memory.limit_in_bytes
-rw-r--r-- 1 root root 0 Oct 27 19:53 memory.soft_limit_in_bytes
Again I’ve left out a lot of stuff to keep this focused. We’ll ignore memory.soft_limit_in_bytes
for now, and instead zoom in on the memory.limit_in_bytes
property, which is the one that sets a memory limit. It is the cgroup equivalent of the --memory
docker run argument, and the memory resource limit in kubernetes. Let’s take a look:
$ sudo cat /sys/fs/cgroup/memory/kubepods/burstable/podfbc202d3-da21-11e8-ab5e-42010a80014b/0a1b22ec1361a97c3511db37a4bae932d41b22264e5b97611748f8b662312574/memory.limit_in_bytes
9223372036854771712
This is the value on my system when no limit is set. For an explanation of why see this brief Stackoverflow post. So we can see that not setting a memory limit in kubernetes caused docker to create the container with HostConfig.Memory
set to 0, which resulted in the container process being placed into a memory cgroup with the default “no limit” value for memory.limit_in_bytes
. Now let’s create a pod with a memory limit of 100 mebibytes:
$ kubectl run limit-test --image=busybox --limits "memory=100Mi" --command -- /bin/sh -c "while true; do sleep 2; done"
deployment.apps "limit-test" created
And again we can use kubectl to verify that the pod was created with our specified limit:
$ kubectl get pods limit-test-5f5c7dc87d-8qtdx -o=jsonpath='{.spec.containers[0].resources}'
map[limits:map[memory:100Mi] requests:map[memory:100Mi]]
You’ll note right away that in addition to the limit we set, the pod has now got a memory request. When you set a limit, but not a request, kubernetes defaults the request to the limit. If you think about it from the scheduler’s perspective it makes sense. We’ll talk more about the request below. Once the pod is up we can see how docker has configured the container and the process’s memory cgroup:
$ docker ps | grep busy | cut -d' ' -f1
8fec6c7b6119$ docker inspect 8fec6c7b6119 --format '{{.HostConfig.Memory}}'
104857600$ ps ax | grep /bin/sh
29532 ? Ss 0:00 /bin/sh -c while true; do sleep 2; done$ sudo cat /proc/29532/cgroup
...
6:memory:/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11$ sudo cat /sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.limit_in_bytes
104857600
As you can see docker set up the process’s memory cgroup with the appropriate limit based on our containerSpec. But what does this actually mean at run time? Linux memory management is a complex topic, but what’s important for kubernetes engineers to know is this: when a host comes under memory pressure the kernel may elect to kill processes. ̶I̶f̶ ̶a̶ ̶p̶r̶o̶c̶e̶s̶s̶ ̶i̶s̶ ̶u̶s̶i̶n̶g̶ ̶m̶o̶r̶e̶ ̶m̶e̶m̶o̶r̶y̶ ̶t̶h̶a̶n̶ ̶i̶t̶s̶ ̶l̶i̶m̶i̶t̶ ̶i̶t̶ ̶m̶o̶v̶e̶s̶ ̶t̶o̶w̶a̶r̶d̶ ̶t̶h̶e̶ ̶t̶o̶p̶ ̶o̶f̶ ̶t̶h̶e̶ ̶l̶i̶s̶t̶ ̶o̶f̶ ̶p̶o̶t̶e̶n̶t̶i̶a̶l̶ ̶v̶i̶c̶t̶i̶m̶s̶̶ [not really, see update below]. Since kubernetes’ job is to pack as much stuff onto a node as possible memory pressure on those hosts is not uncommon. If your container is using too much memory it is likely to be oom-killed. If it is docker will be notified by the kernel, kubernetes will find out from docker and depending on settings may try to restart the pod.
[UPDATE, part 1] Reader Ej Campbell correctly pointed out that the above paragraph is wrong on a couple of important points, and that led me to do some additional research on the implementation of the oomkiller in linux. Documentation on the topic is sparse and sometimes out of date, but I’ll give a brief summary of how I believe it works. Processes that are not in memory cgroups are handled by the global oomkiller, and when the kernel is unable to allocate pages it will essentially kill the process using the most physical ram, scaled by a factor called the oom adjust score that is used to protect important processes. Processes that are in memory cgroups are affected by the cgroup oomkiller, which will always kill them if they set a limit and then exceed it. In these cases you’ll see a log message from the oomkiller that begins Memory cgroup out of memory: Kill process ...
. [/UPDATE]
So what about the memory request that kubernetes created by default in our pod? Does having a 100Mi memory request affect the cgroup? Perhaps it sets the memory.soft_limit_in_bytes
property that we saw earlier? Let’s look:
$ sudo cat /sys/fs/cgroup/memory/kubepods/burstable/pod88f89108-daf7-11e8-b1e1-42010a800070/8fec6c7b61190e74cd9f88286181dd5fa3bbf9cf33c947574eb61462bc254d11/memory.soft_limit_in_bytes
9223372036854771712
As you can see the soft limit is still set to the default “no limit” value. Even though docker supports setting the soft limit through the --memory-reservation
argument to docker run
kubernetes does not make use of it. Does that mean specifying a memory request for your container is not important? No it doesn’t. If anything requests are more important than limits. Limits tell the linux kernel when to consider your process a candidate for freeing up memory. Requests help the kubernetes scheduler figure out where it can run your pod. Not setting them, or setting them artificially low, can have bad effects.
For example, suppose you run a pod with no memory request, and a high limit. As we just saw kubernetes will default the request to the limit, and if no node has that much ram available the pod will fail to schedule even though its actual requirements might be much less. On the other hand if you run a pod with an artificially low request you just encourage the kernel to oom-kill it. Why? Let’s assume your pod normally uses 100 Mi of ram but you run it with a 50 Mi request. If you have a node with 75 Mi free the scheduler may choose to run the pod there. When pod memory consumption later expands to 100 Mi this puts the node under pressure, ̶a̶t̶ ̶w̶h̶i̶c̶h̶ ̶p̶o̶i̶n̶t̶ ̶t̶h̶e̶ ̶k̶e̶r̶n̶e̶l̶ ̶m̶a̶y̶ ̶c̶h̶o̶o̶s̶e̶ ̶t̶o̶ ̶k̶i̶l̶l̶ ̶y̶o̶u̶r̶ ̶p̶r̶o̶c̶e̶s̶s̶ [not exactly, see update below]. So it is important to get both memory requests and memory limits right.
[UPDATE, part 2] Continuing with the update inspired by reader Ej Campbell’s note: this paragraph above should have made clear that while requests are a very important input to pod scheduling decisions, they also play a role at execution time. As noted in the update above a container that sets and exceeds a cgroup hard memory limit will be oomkilled regardless of the global memory availability on the node. If the node does come under memory pressure then the kubelet may decide to evict pods that are using more than their requested amount of memory, whether or not the pod has hit the hard limit. [/UPDATE]
Hopefully this post has helped to clarify how kubernetes container memory limits are set and implemented, and why it is important that you set these limits on containers in your own pods. Kubernetes can do an impressive job of scheduling your pods intelligently and maximizing the utilization of your cloud compute resources, provided that you give it the information it needs. In the next post we’ll look at how cpu limits work, and briefly touch on how to set default requests and limits per namespace.