Tracking Down “Invisible” OOM Kills in Kubernetes

7 min readMay 17, 2022

An “Invisible” OOM Kill happens when a child process in a container is killed, not the init process. It is “invisible” to Kubernetes and not detected. What is OOM? well.. not a good thing.

Recently I started rebuilding my personal self-hosted Kubernetes cluster using ArgoCD. During my experimentation with ArgoCD I noticed several “cgroup out of memory” message on the node’s Linux console (tty device).

Memory cgroup out of memory: killed process (helm) console message

Console error messages will not be shown in SSH sessions. They are not even shown on Linux Graphical Desktops. They are normally only visible from tty device (typically the monitor physically attached to a Linux server). Unless you have access to the tty device, you likely will not know processes are being OOM Killed. Of course you can find this buried within the Linux kernel logs, but why would you be looking there unless you already knew something was wrong.

These console OOM messages would appear at seemingly random times, but this did not seem to impact the deployment testing in a noticeable way. After ignoring them for a day or two the frequency of the console messages started to increase rapidly.

Some observations

It was always a “helm” process being killed
I was not executing helm commands when these appeared
The ArgoCD dashboard would show Helm based applications as “Out of Sync” or “Unknown” unexpectedly but eventually would go away

Having dealt with messages like this before with Docker when using Docker-Compose to implement memory limits on a container, such as:

mem_limit: 256m

It was pretty straight forward with Docker figuring out which container needed a bit more ram, increase the mem_limit value and restart the container. Kubernetes is a bit different.

The Prime Suspect

Let me state up front this is no fault of the ArgoCD application. When installing ArgoCD via Helm Chart, in reviewing the values.yaml file it had resource limits and requests defined but commented out such as:

# -- Resource limits and requests for the repo server pods                               resources: {}
#  limits:
#    cpu: 50m
#    memory: 128Mi
#  requests:
#    cpu: 10m
#    memory: 64Mi

So I enabled these… and not just this set, it has A LOT of these in the yaml file.

It may have been an unreasonable assumption that these were sane and usable numbers from years of community member testing and feedback.

I had an expectation of getting some Out of Memory (OOM) killed processes by enabling these, but I also had the expectation Kubernetes would know it happened.

Based on the inconsistent application states I was observing in the ArgoCD Dashboard and that ArgoCD was the only application expected to be using Helm client, it seems like a reasonable place to start.

But start where? ArgoCD had 7 pods running, none of them showed they were being terminated:

$ kubectl get pods -n argocd 
NAME                                   READY  STATUS   RESTARTS AGE 
argocd-application-controller-0        1/1    Running  0       7d20h 
argocd-applicationset-controller-7...  1/1    Running  0       3h48m 
argocd-dex-server-6fd8b59f5b-ntr2r...  1/1    Running  0       3h48m 
argocd-notifications-controller-55...  1/1    Running  0       7d20h 
argocd-redis-79bdbdf78f-cd2mn          1/1    Running  0       7d20h 
argocd-repo-server-754cc95f5d-2n9vt    1/1    Running  0       7d20h 
argocd-server-57579bcb8b-khsjd         1/1    Running  0       7d20h

NOTE: Output format above modified to fit.

I was hoping that one of the pods listed above would show status OOMKilled or at least have a non-zero restart count.

Troubleshooting

I started with the excellent K8s LENS application to review pod logs and check for any references to Exit Code: 137 which historically is returned when a process is terminated by OOM_Killer.

I found nothing obvious that would point to which pod had a problem.

If you see OOMKilled or Exit Code: 137 there are plenty of guides on how to troubleshoot this such as:

How to Diagnose OOMKilled Error in Kubernetes Application

Photo by Ihor Dvoretskyi on Unsplash

medium.com

I was observing OOM console messages without containers being terminated. None of the guides I could find referenced this scenario. I was not exactly sure how to proceed troubleshooting Kubernetes. I’ve been using Linux for a long time, way before the Cloud was even a thing (well it actually had a different technical meaning back then). I had some ideas.

OOM is a Kernel Issue not Kubernetes Issue

Looking for clues in the Linux system logs was the next logical step. My Kubernetes cluster was running on Ubuntu 20.04, which is a systemd based Linux Distribution. The tool to use is journalctl.

$ sudo journalctl --utc -ke

Where:
--utc show time in Coordinated Universal Time (UTC).
-ke show only kernel messages and jumps to end of the log.

Within journalctl I can search for most recent OOM message. By default journalctl uses the less command in the background for paging through the kernel logs. Since we instructed journalctl to jump to the end of the kernel log, the ? key is used to search backwards towards the front of the log (the / key can is used to search forward). I searched for keyword “Killed”:

? Killed

Upon pressing ENTER key it found:

kernel: Memory cgroup out of memory: Killed process 18661 (helm) total-vm:748664kB, anon-rss:41748kB, file-rss:0kB, shmem-rss:0kB, UID:999 pgtables:244kB oom_score_adj:992

kernel: oom_reaper: reaped process 18661 (helm), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The Process ID (pid) number 18661 found matches the Server Console error:

Memory cgroup out of memory: killed process (helm) console message

This is likely the incident we are looking for. Pressing the Up arrow key in journalctl to scroll up a little more I found this table:

kernel: Tasks state (memory values in pages):
kernel: [pid  ]  tgid total_vm  rss   pgtables oom_score_adj  name
kernel: [14764]  14764     689  23    40960    992            tini
kernel: [15058]  15058  198502  8297  307200   992 argocd-repo-ser
kernel: [16381]  16381   19580  78    53248    992       gpg-agent
kernel: [18661]  18661  187166  10437 249856   992            helm
kernel: [18796]  18796  187198  7531  221184   992            helm
kernel: [18972]  18967  187055  4934  204800   992            helm

NOTE: Output format above modified to fit.

The key line in the table above is the third from the bottom. That line has a reference to pid 18661 and helm. This is the process which got OOM killed. It appears that two other instances of helm (pid 18796 and pid 18972) may have been running as well.

Perhaps even more important we have a partial ArgoCD pod reference argocd-repo-ser. We can dig deeper.

Confirming the Memory Limit

Scrolling Up in the journalctl kernel messages about 20 lines, I found reference to the memory limits applied:

kernel: memory: usage 131072kB, limit 131072kB, failcnt 6809
kernel: memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
kernel: kmem: usage 4604kB, limit 9007199254740988kB, failcnt 0
kernel: Memory cgroup stats for /kubepods/burstable/podaf9...3279ae:
kernel: anon 129208320

A usage 131072kB and limit 131072kBare shown. When the memory usage reaches the memory limit, that is when the Linux OOM Killer is activated.

Let’s confirm the memory limit assigned to the container in argocd-repo-server pod:

$ kubectl get pod argocd-repo-server-5569c7b657-kg94z -n argocd \
  -o jsonpath="{.spec.containers[*].resources.limits}" {"cpu":"50m","memory":"128Mi"}

Container has a memory limit of 128Mi where Mi is Mebibytes.
The value from the Kernel log was 131072kBwhere kB is Kibibytes.

Are these the same value? A simple Google search of “kibibytes to mebibytes” will find a converter:

They are the same! Now we have a value in the Kernel log matching exactly the memory limit applied to our suspect container.

I can now increase this container’s memory limit with some confidence. I bumped the resource memory limit by 64Mi from 128Mi to 192Mi (and increased the request memory limit by 64Mi from 64Mi to 128Mi).

I have not seen an OOM Killed console message since increasing this value.

Conclusion

When the Linux OOM Killer activated, it selected a process within the container to be killed. Apparently only when OOM selects the container’s init process PID 1 will the container itself be killed and have a status of OOMKilled. If that container was marked as restartable, then Kubernetes would restart the container and then you would see an increase in restart count.

As I’ve seen, when PID 1 is not selected then some other process inside the container is killed. This makes it “invisible” to Kubernetes. If you are not watching the tty console device or scanning kernel logs, you may not know that part of your containers are being killed. Something to consider when you enable container memory limits.

Starting with Kubernetes 1.28 it appears the behavior of OOM kills has changed as it enabled cgroup grouping. This should effectively kill the entire container instead of individual processes within it. See this article for more details:

Kubernetes Silent Pod Killer