How to Monitor OOM Kills on GKE

Published in

Ackee

4 min readApr 21, 2022

TL;DR skip to the part Ok, so what did you do?

You might be wondering why we even bother ourselves with this “how to”. You can just go to Google Cloud Monitoring, find a correctly labeled metric, and move on. That might be true when you are reading this article, but it wasn’t in my case. For some reason, monitoring Out Of Memory (OOM) kills on GKE doesn’t seem important. The situation might have changed. Therefore, go and check. If it doesn’t, let’s continue.

Using metrics

As I was trying to explain, my situation currently looks like this:

There is no metric called OOMKilled , only metrics related to memory.

You can watch your memory metrics and detect sudden spikes. It’s not going to help you once the spike happens too fast. Monitoring doesn’t have the sampling frequency needed to notice it. In case the OOM kill takes hours, you won’t notice either. No spike there, unfortunately. Any indirect measurement would lose its meaning once you can just check the Kubernetes control plane if any of the pods reported OOMKilled.

How about checking logs?

The thing is that OOM kills do not even get logged correctly. Only during the state change of deployment. Let’s have the following deployment for testing:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oom-tester
  namespace: development
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oom-tester
  template:
    metadata:
    labels:
      app: oom-tester
    spec:
      containers:
      - name: test
        image: ubuntu
        command:
        - "perl"
        - "-wE"
        - "my @xs; for (1..2**20) { push @xs, q{a} x 2**20 }; say scalar @xs;"
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"

You might think this generates a log message in Logs Explorer — well, no. Not exactly. If you created the query in Log Explorer withOOMKill , you would get the following:

"I0417 15:15:46.250863    1833 log_monitor.go:160] New status generated: &{Source:kernel-monitor Events:[{Severity:warn Timestamp:2022-04-17 15:15:45.770989804 +0000 UTC m=+117.097198737 Reason:OOMKilling Message:Memory cgroup out of memory: Killed process 4817 (perl) total-vm:138468kB, anon-rss:130352kB, file-rss:4548kB, shmem-rss:0kB, UID:0 pgtables:308kB oom_score_adj:983}] Conditions:[{Type:KernelDeadlock Status:False Transition:2022-04-17 15:13:53.38722946 +0000 UTC m=+4.713438426 Reason:KernelHasNoDeadlock Message:kernel has no deadlock} {Type:ReadonlyFilesystem Status:False Transition:2022-04-17 15:13:53.387229627 +0000 UTC m=+4.713438553 Reason:FilesystemIsNotReadOnly Message:Filesystem is not read-only}]}"

This log is from node-problem-detector, so the system knowsperl is doing something wrong, but there is no mention of which pod is having the issue. Furthermore, the log is from resource type Kubernetes Node, not Kubernetes Cluster, Container or Pod. It wouldn’t be the first place where I would like to go check.

Let’s change deployment not to generate OOM kills in repeat and just sleep for a while:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: oom-tester
  namespace: development
spec:
  replicas: 1
  selector:
    matchLabels:
      app: oom-tester
  template:
    metadata:
    labels:
      app: oom-tester
    spec:
      containers:
      - name: test
        image: ubuntu
        command:
        - "sleep"
        - "999999"
        resources:
          requests:
            memory: "128Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"

Now we can see among many logs the message we were hoping for:

terminated: {
  containerID: "containerd://19fca44bab42c657219189547d841fe31c2d4a6aced9df0036adc88f0418c760"
  exitCode: 137
  finishedAt: "2022-04-17T15:31:07Z"
  reason: "OOMKilled"
  startedAt: "2022-04-17T15:31:06Z"
}

It’s obvious now that the state of the pod is reported only once it is changed by kubectl, not by the failure of the pod itself.

Well, Martin, you are not the first person who had this issue

Duh! I am aware of that. I was googling for anything reasonable, and I could only find the approach by monitoring messages from the container socket. One example is kubernetes-oomkill-exporter. It checks the docker socket and exports the oom kills as Prometheus metric. It also contains DaemonSet deployment to make it work for you.

Kubernetes-oomkill-exporter seems cool. It might have a few issues with the security but since this is a monitoring tool with public source code, that would be such a deal-breaker. The problem is Google Cloud Monitoring doesn’t support Prometheus. You would need to install workload monitoring which is not directly designated for this purpose. The representation of Prometheus metrics is not as nice in Google Cloud Monitoring as in Prometheus itself.

Ok, so what did you do?

Glad you finally asked. Well, my solution is super simple. Just scrape the Kubernetes control plane, check if there is a pod with state OOMKilled and report it to the Cloud Monitoring. That’s it.

Pod status is available under the following URL:

response = requests.get(
 f"{APISERVER}/api/v1/namespaces/{NAMESPACE}/pods",
 verify=f"{SERVICEACCOUNT}/ca.crt",
 headers={'Authorization': f"Bearer {TOKEN}"}
)

Check is just compares last known state from the pod:

last_state = container_statuses.get('lastState', {})
if last_state is {}:
  continue
for k, v in last_state.items():
  if k == "terminated":
    if v['reason'] == "OOMKilled":
      report_pod = True

If you are interested, you can deploy the whole thing by the terraform module. It also installs the Grafana dashboard with an alert. Once the OOM kill happens, you will receive an alert with the pod name. It will close once the pod changes to a state without OOM kill.

Final notes

Not even my solution seems to be good enough. If you don’t have Prometheus in your cluster, your hands may be tight and rely only on Google Cloud Monitoring. Maybe, Google will create a new metric, and my work will be worthless. And that would be fine!

If you find anything better, please let me know. I will gladly put you in the introduction of this article with the links to a better solution.

Originally published at https://www.ackee.agency on April 21, 2022.