Kubernetes resource limits and kernel cgroups

Krishnakumar R
5 min readNov 27, 2019

--

Resource limits provide a way to manage the resource consumption of application pods. In this article we first look at how users can set memory resource limits for containers in a pod. Then we look at how the container runtimes collaborate with Kubernetes to make this happen. Following this, we run some experiments with cgroups and bpftrace to get insights into how the Linux kernel handles the memory limits.

Kubernetes resource limit

The following is an example pod yaml modelled after the sample busybox.yaml which has resource limits set:

apiVersion: v1
kind: Pod
metadata:
name: busybox1
labels:
app: busybox1
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
resources:
limits:
memory: "20Mi"

restartPolicy: Always

The resource limit for memory here is set to 20MiB. How does Kubernetes establish these limits for the containers ? The answer lies in the collaboration between Kubernetes and the underlying container run time, such as containerd.

Container runtimes like containerd are ultimately responsible for running the containers on the nodes. Container Runtime Interface(CRI) provides a standardised way for Kubernetes to communicate with the container runtimes. The diagram below depicts the calls from Kubelet to the container runtime via CRI.

When a container is to be created the Kubelet on the node will call the underlying container runtime via the CreateContainer call in the ContainerManager interface:

CreateContainer(podSandboxID string, 
config *runtimeapi.ContainerConfig,
sandboxConfig *runtimeapi.PodSandboxConfig)
(string, error)

The resource limits are filled into OS specific container config and then sent to the container runtime.

type LinuxContainerResources struct {
<<..>>
// Memory limit in bytes. Default: 0 (not specified).
MemoryLimitInBytes int64
<<…>>
}

For example, in a Linux environment, the LinuxContainerResources struct embedded within the ContainerConfig consists of MemoryLimitInBytes field which is used to communicate the memory limit to be set. CreateContainer is called with the ContainerConfig parameter. These values are interpreted by the runtime and the resource limits set using cgroups.

Container runtime

In order to understand the container runtime’s operations, let’s spin up the busybox container with the following yaml:

% cat busybox-resources.yaml
apiVersion: v1
kind: Pod
metadata:
name: busybox0
labels:
app: busybox0
spec:
containers:
- image: busybox
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
name: busybox
resources:
requests:
memory: "10Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "500m"
restartPolicy: Always

When the containers are created on the node, a path in the cgroup sysfs is created. For example on an aks-engine cluster, the following is what the path looks like for the specific container:

% pwd
/sys/fs/cgroup/memory/kubepods/burstable/pod2d42976a-6d2c-4d1e-aa52-c0fc8e3964a5/9aa8bbbe72708633daf2ee74246be0d3b965a3a2878e46e12fe0b29df34fb3db

Now let’s check the memory limit set here:

% cat memory.limit_in_bytes
67108864

As can be seen, the runtime has indicated the memory limits it wants to apply via cgroup entries.

Cgroups memory limits and Linux kernel

The resource consumption of containers are enforced by the cgroups feature in the Linux kernel. bpftrace is a tool used for Kernel performance analysis and tracing. It leverages the BPF(Berkeley Packet Filter) subsystem in the Linux kernel. Let’s see how the memory limits are enforced by the Kernel by running an experiment involving memory allocation, cgroup and some bpftrace magic :-)

Let’s consider the following simple C program which allocates memory by calling malloc and then writes to the memory location. As shown in the diagram above, when data is written to the allocated locations, kernel gets page faults. These page faults result in real physical pages being allocated and linked to the memory corresponding to the malloc buffer.

#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>
#define PAGE (4*1024)
int get_key() {
printf(“\nPress any key to continue…\n”);
return getc(stdin);
}
int main(int argc, char **argv) {
char c __attribute__((unused));
unsigned char *p;
int count=0;
if (argc < 2) {
printf(“Usage: mem <pages to allocate>\n”);
return -1;
}
int alloc_mem = atoi(argv[1])*PAGE;
printf(“Pid: %d. \n”, getpid());
printf(“Page allocation requested: %d.\n”, alloc_mem);
printf(“Yet to call malloc.\n”);
c = get_key();
p = malloc(alloc_mem);
printf(“Malloc called. No writes yet.”);
c = get_key();
for (int i=0; i<alloc_mem ; i++) {
p[i] = 1;
}
for (int i=0; i<alloc_mem ; i++) {
if (p[i] == 1) count++;
}
printf(“Alloc in bytes: %d\n”, alloc_mem);
printf(“Page count: %d\n”, count/PAGE);
c = get_key();
}

For ease of understanding we just use memory allocation in pages. On running the above program, we get a prompt which gives us the information about the pid of the process:

% ./mem 21
Pid: 3088.
Page allocation requested: 86016.
Yet to call malloc.
Press any key to continue…

After some code travel (;-)) in the area of linux/mm/ directory in the Linux kernel source we can see that the function which gets called to check if the cgroups are meeting the limits is named try_charge. Let’s use bpftrace to trace the function try_charge in the kernel. We request bpftrace to get us two maps — one which stores the return value (@ret) and one which stores the stack trace(@[kstack]). Let’s start the bpftrace by running the following:

% bpftrace -e ‘kretprobe:try_charge /pid == 3088/ { @ret[retval] = count(); @[kstack]=count(); }’
Attaching 1 probe…

We are using cgroups v1 for the experiments. In this case, the swap memory allocations are not charged against the cgroup of the process, hence we switch off the swap before proceeding further.

Now let’s create a cgroup for the corresponding process and set the resource limit of 20 pages (note that this was run on x86_64 architecture hence the 4K pages). The following commands are run in /sys/fs/cgroup/memorydirectory. First we create a directory and then write the pid of our ‘mem’ process into the cgroup.procs, which moves the process under the cgroup-mem-demo cgroup. Then we set the memory limit by writing the value to memory.limit_in_bytes.

% mkdir cgroup-mem-demo
% cd cgroup-mem-demo/
% echo 3088 > cgroup.procs
% echo 81920 > memory.limit_in_bytes

We then proceed with the prompt in memory program to continue to run malloc and then write to the locations. The following appears on the screen as a result

Malloc called. No writes yet.
Press any key to continue…
Killed

The mem process is killed since it tried to access the resource limit above the limit set — 20 pages. Now we can exit the bpftrace session we started earlier which gives the following output.

@[
kretprobe_trampoline+0
__handle_mm_fault+2270
handle_mm_fault+177
__do_page_fault+641
do_page_fault+46
do_async_page_fault+81
async_page_fault+69
]: 21
@ret[4294967284]: 1
@ret[0]: 20

As can be seen above, try_charge returned 0 (success) for 20 allocations which is equal to the 20 single page allocations which mem.c made. But it failed to return success at the 21st allocation since the memory limit set by cgroup limit_in_bytes is set to 20 pages(20*4*1024 => 81920bytes). Viola !! The resource limits are thus enforced by the Kernel :-).

Summary

In this article we looked at how Kubernetes works in conjunction with the container runtime to enforce memory resource limits. We ran experiments using a sample program which allocates and accesses memory and looked at how Linux kernel enforces the memory limit.

Until next time, Bye !! and take care :-)

Acknowledgements

Thank you — Brian, Anish and Kal for reviewing drafts of this article and providing their valuable feedback.

Follow Krishnakumar on Twitter at https://twitter.com/kkwriting .

--

--