Memory Limit of POD and OOM Killer

https://unsplash.com/photos/BFRdqVAMAhU

In the past few days, some of my Pods kept on crashing and OS Syslog shows the OOM killer kills the container process. I did some research to find out how these thing works.

Pod memory limit and cgroup memory settings

Let's test on K3s. Create a pod setting the memory limit to 123Mi, a number that can be recognized easily.

kubectl run --restart=Never --rm -it --image=ubuntu --limits='memory=123Mi' -- sh
If you don't see a command prompt, try pressing enter.
root@sh:/#

In another shell, find out the uid of the pods,

kubectl get pods sh -o yaml | grep uid
uid: bc001ffa-68fc-11e9-92d7-5ef9efd9374c

At the server where the pod is running, check the cgroup settings based on the uid of the pods,

cd /sys/fs/cgroup/memory/kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c
cat memory.limit_in_bytes
128974848

The number 128974848 is exact 123Mi (123*1024*1024). So its more clear now, Kubernetes set the memory limit through cgroup. Once the pod consume more memory than the limit, cgroup will start to kill the container process.

Stress test

Let's install the stress tools on the Pod through the opened shell session.

root@sh:/# apt update; apt install -y stress

In the meantime, monitoring the Syslog by running dmesg -Tw

Run the stress tool with the memory within the limit 100M first. It’s launched successfully.

root@sh:/# stress --vm 1 --vm-bytes 100M &
[1] 271
root@sh:/# stress: info: [271] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd

Now trigger the second stress test,

root@sh:/# stress --vm 1 --vm-bytes 50M
stress: info: [273] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [271] (415) <-- worker 272 got signal 9
stress: WARN: [271] (417) now reaping child worker processes
stress: FAIL: [271] (451) failed run completed in 7s

The first stress process (process id 271) was killed immediately with the signal 9.

In the meantime, the syslogs shows

[Sat Apr 27 22:56:09 2019] stress invoked oom-killer: gfp_mask=0x14000c0(GFP_KERNEL), nodemask=(null), order=0, oom_score_adj=939
[Sat Apr 27 22:56:09 2019] stress cpuset=a2ed67c63e828da3849bf9f506ae2b36b4dac5b402a57f2981c9bdc07b23e672 mems_allowed=0
[Sat Apr 27 22:56:09 2019] CPU: 0 PID: 32332 Comm: stress Not tainted 4.15.0-46-generic #49-Ubuntu
[Sat Apr 27 22:56:09 2019] Hardware name: BHYVE, BIOS 1.00 03/14/2014
[Sat Apr 27 22:56:09 2019] Call Trace:
[Sat Apr 27 22:56:09 2019] dump_stack+0x63/0x8b
[Sat Apr 27 22:56:09 2019] dump_header+0x71/0x285
[Sat Apr 27 22:56:09 2019] oom_kill_process+0x220/0x440
[Sat Apr 27 22:56:09 2019] out_of_memory+0x2d1/0x4f0
[Sat Apr 27 22:56:09 2019] mem_cgroup_out_of_memory+0x4b/0x80
[Sat Apr 27 22:56:09 2019] mem_cgroup_oom_synchronize+0x2e8/0x320
[Sat Apr 27 22:56:09 2019] ? mem_cgroup_css_online+0x40/0x40
[Sat Apr 27 22:56:09 2019] pagefault_out_of_memory+0x36/0x7b
[Sat Apr 27 22:56:09 2019] mm_fault_error+0x90/0x180
[Sat Apr 27 22:56:09 2019] __do_page_fault+0x4a5/0x4d0
[Sat Apr 27 22:56:09 2019] do_page_fault+0x2e/0xe0
[Sat Apr 27 22:56:09 2019] ? page_fault+0x2f/0x50
[Sat Apr 27 22:56:09 2019] page_fault+0x45/0x50
[Sat Apr 27 22:56:09 2019] RIP: 0033:0x558182259cf0
[Sat Apr 27 22:56:09 2019] RSP: 002b:00007fff01a47940 EFLAGS: 00010206
[Sat Apr 27 22:56:09 2019] RAX: 00007fdc18cdf010 RBX: 00007fdc1763a010 RCX: 00007fdc1763a010
[Sat Apr 27 22:56:09 2019] RDX: 00000000016a5000 RSI: 0000000003201000 RDI: 0000000000000000
[Sat Apr 27 22:56:09 2019] RBP: 0000000003200000 R08: 00000000ffffffff R09: 0000000000000000
[Sat Apr 27 22:56:09 2019] R10: 0000000000000022 R11: 0000000000000246 R12: ffffffffffffffff
[Sat Apr 27 22:56:09 2019] R13: 0000000000000002 R14: fffffffffffff000 R15: 0000000000001000
[Sat Apr 27 22:56:09 2019] Task in /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c/a2ed67c63e828da3849bf9f506ae2b36b4dac5b402a57f2981c9bdc07b23e672 killed as a result of limit of /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c
[Sat Apr 27 22:56:09 2019] memory: usage 125952kB, limit 125952kB, failcnt 3632
[Sat Apr 27 22:56:09 2019] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[Sat Apr 27 22:56:09 2019] kmem: usage 2352kB, limit 9007199254740988kB, failcnt 0
[Sat Apr 27 22:56:09 2019] Memory cgroup stats for /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c: cache:0KB rss:0KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
[Sat Apr 27 22:56:09 2019] Memory cgroup stats for /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c/79fae7c2724ea1b19caa343fed8da3ea84bbe5eb370e5af8a6a94a090d9e4ac2: cache:0KB rss:48KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:48KB inactive_file:0KB active_file:0KB unevictable:0KB
[Sat Apr 27 22:56:09 2019] Memory cgroup stats for /kubepods/burstable/podbc001ffa-68fc-11e9-92d7-5ef9efd9374c/a2ed67c63e828da3849bf9f506ae2b36b4dac5b402a57f2981c9bdc07b23e672: cache:0KB rss:123552KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:123548KB inactive_file:0KB active_file:0KB unevictable:0KB
[Sat Apr 27 22:56:09 2019] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[Sat Apr 27 22:56:09 2019] [25160] 0 25160 256 1 28672 0 -998 pause
[Sat Apr 27 22:56:09 2019] [25218] 0 25218 4627 872 77824 0 939 bash
[Sat Apr 27 22:56:09 2019] [32307] 0 32307 2060 275 57344 0 939 stress
[Sat Apr 27 22:56:09 2019] [32308] 0 32308 27661 24953 253952 0 939 stress
[Sat Apr 27 22:56:09 2019] [32331] 0 32331 2060 304 53248 0 939 stress
[Sat Apr 27 22:56:09 2019] [32332] 0 32332 14861 5829 102400 0 939 stress
[Sat Apr 27 22:56:09 2019] Memory cgroup out of memory: Kill process 32308 (stress) score 1718 or sacrifice child
[Sat Apr 27 22:56:09 2019] Killed process 32308 (stress) total-vm:110644kB, anon-rss:99620kB, file-rss:192kB, shmem-rss:0kB
[Sat Apr 27 22:56:09 2019] oom_reaper: reaped process 32308 (stress), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The process id 32308 on the host is OOM killed. The more interesting stuff is at the last part of the log,

For this pod, there are processes that are the candidates that the OOM killer would select to kill. The basic process pause , which holds the network namespaces is having a oom_score_adj value of -998, is guaranteed not be killed. The rest of the processes in the container are all having the oom_score_adj value of 939. We can validate this value based on the formula from the Kubernetes document as below,

min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Find out the node allocatable memory by

kubectl describe nodes k3s | grep Allocatable -A 5
Allocatable:
cpu: 1
ephemeral-storage: 49255941901
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 2041888Ki

The request memory is by default the same as limit value if its not set. So we have the oom_score_adj value as 1000–123*1024/2041888=938.32, which is close to the value 939 in the syslog. (I am not sure how exact the 939 is obtained in the OOM killer)

It’s noticed that all the process in the container has the same oom_score_adj value. The OOM killer will calculate the OOM value based on the memory usage and fine tuned with the oom_score_adj value. Finally it kill the first stress process which use the most of the memory, 100M whose oom_score value is 1718.

Conclusion

Kubernetes manages the Pod memory limit with cgroup and OOM killer. We need to be careful to separate the OS OOM and the pods OOM.