Kubernetes resource limits and kernel cgroups

Resource limits provide a way to manage the resource consumption of application pods. In this article we first look at how users can set memory resource limits for containers in a pod. Then we look at how the container runtimes collaborate with Kubernetes to make this happen. Following this, we run some experiments with cgroups and bpftrace to get insights into how the Linux kernel handles the memory limits.

Kubernetes resource limit

The following is an example pod yaml modelled after the sample busybox.yaml which has resource limits set:

The resource limit for memory here is set to 20MiB. How does Kubernetes establish these limits for the containers ? The answer lies in the collaboration between Kubernetes and the underlying container run time, such as containerd.

Container runtimes like containerd are ultimately responsible for running the containers on the nodes. Container Runtime Interface(CRI) provides a standardised way for Kubernetes to communicate with the container runtimes. The diagram below depicts the calls from Kubelet to the container runtime via CRI.

When a container is to be created the Kubelet on the node will call the underlying container runtime via the CreateContainer call in the ContainerManager interface:

The resource limits are filled into OS specific container config and then sent to the container runtime.

For example, in a Linux environment, the LinuxContainerResources struct embedded within the ContainerConfig consists of MemoryLimitInBytes field which is used to communicate the memory limit to be set. CreateContainer is called with the ContainerConfig parameter. These values are interpreted by the runtime and the resource limits set using cgroups.

Container runtime

In order to understand the container runtime’s operations, let’s spin up the busybox container with the following yaml:

When the containers are created on the node, a path in the cgroup sysfs is created. For example on an aks-engine cluster, the following is what the path looks like for the specific container:

Now let’s check the memory limit set here:

As can be seen, the runtime has indicated the memory limits it wants to apply via cgroup entries.

Cgroups memory limits and Linux kernel

The resource consumption of containers are enforced by the cgroups feature in the Linux kernel. bpftrace is a tool used for Kernel performance analysis and tracing. It leverages the BPF(Berkeley Packet Filter) subsystem in the Linux kernel. Let’s see how the memory limits are enforced by the Kernel by running an experiment involving memory allocation, cgroup and some bpftrace magic :-)

Let’s consider the following simple C program which allocates memory by calling malloc and then writes to the memory location. As shown in the diagram above, when data is written to the allocated locations, kernel gets page faults. These page faults result in real physical pages being allocated and linked to the memory corresponding to the malloc buffer.

For ease of understanding we just use memory allocation in pages. On running the above program, we get a prompt which gives us the information about the pid of the process:

After some code travel (;-)) in the area of linux/mm/ directory in the Linux kernel source we can see that the function which gets called to check if the cgroups are meeting the limits is named try_charge. Let’s use bpftrace to trace the function try_charge in the kernel. We request bpftrace to get us two maps — one which stores the return value (@ret) and one which stores the stack trace(@[kstack]). Let’s start the bpftrace by running the following:

We are using cgroups v1 for the experiments. In this case, the swap memory allocations are not charged against the cgroup of the process, hence we switch off the swap before proceeding further.

Now let’s create a cgroup for the corresponding process and set the resource limit of 20 pages (note that this was run on x86_64 architecture hence the 4K pages). The following commands are run in /sys/fs/cgroup/memorydirectory. First we create a directory and then write the pid of our ‘mem’ process into the cgroup.procs, which moves the process under the cgroup-mem-demo cgroup. Then we set the memory limit by writing the value to memory.limit_in_bytes.

We then proceed with the prompt in memory program to continue to run malloc and then write to the locations. The following appears on the screen as a result

The mem process is killed since it tried to access the resource limit above the limit set — 20 pages. Now we can exit the bpftrace session we started earlier which gives the following output.

As can be seen above, try_charge returned 0 (success) for 20 allocations which is equal to the 20 single page allocations which mem.c made. But it failed to return success at the 21st allocation since the memory limit set by cgroup limit_in_bytes is set to 20 pages(20*4*1024 => 81920bytes). Viola !! The resource limits are thus enforced by the Kernel :-).

Summary

In this article we looked at how Kubernetes works in conjunction with the container runtime to enforce memory resource limits. We ran experiments using a sample program which allocates and accesses memory and looked at how Linux kernel enforces the memory limit.

Until next time, Bye !! and take care :-)

Acknowledgements

Thank you — Brian, Anish and Kal for reviewing drafts of this article and providing their valuable feedback.

Follow Krishnakumar on Twitter at https://twitter.com/kkwriting .

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store