How to detect OOMKilled containers on the GKE environment

Published in

The Zeals Tech Blog

3 min readOct 20, 2020

Introduction

When the kernel of an operating system detects the out of memory (OOM) condition, it will terminate one or more processes that may cause a memory shortage. The processes get OOMKilled. Docker containers are no exception, as well. When the request of memory usage by containers exceeds a limitation that is configured in their manifest, they will get OOMKilled. The OOMKilled containers will be restarted if they are Deployment or DaemonSet resources, while CronJob resources are not recovered.

The CronJob resource is one of important components in our service (say Zeals), which is running on the Google Kubernetes Engine (GKE) environment as described in the document (Japanese only). It regularly enqueues list of chatbot users to which messages will be sent. If a cronjob is failed because of OOMKilled condition, we need to re-enqueue the list so that messages will be sent properly. In other words, we need to detect which containers get OOMKilled.

This document describes how we detect OOMKilled containers on the GKE environment.

Candidate applications to detect OOMKilled containers

We found several applications that might detect the OOMKilled condition:

Unfortunately the above applications are NOT suitable for our purpose. Our reasons are as follows:

kubernetes-event-exporter

This can notify us of only the node name on which one or more containers are OOMKilled. Thus we cannot identify which container gets OOMKilled.

kubernetes-oom-event-generator

This can notify us of the fact that the previous status is OOMKilled. This means that we cannot detect the event that happens for the first time. It is useful for Deployment or DaemonSet resources, but not for CronJob.

kubernetes-oomkill-exporter

This can send messages to Cloud Logging that containers get OOMKilled. Looking at the messages carefully, we finally realize that most of messages are not related to the OOMKilled condition that we want to detect. What are those? Because some of our docker containers are burstable, the OOMKilled conditions seem to be recorded when memory overcommitment happens.

What to do the next?

We developed a Go application using kubernetes-oomkill-exporter as a reference. The main processing flow is as follows:

Obtain a container list by calling Docker API through the socket file, /var/run/docker.sock, which is mounted into the container of our application.
Pick up containers whose exit code is 137
Obtain detailed information on the containers
Send messages to the Standard Out (stdout) stream if State.OOMKilled is true
Repeat 1 to 4 regularly

Now we are going to breifly explain several major functions of our application.

Obtain container list

We use the function, ContainerList, to obtain the container list. Because we need information about the containers that have already finished, the optional flag All is set to be true.

The object, dm.client, is created as follows,

Obtain detailed information on containers

We use the function, ContainerInspect, to obtain information on the specific container

Create messages to be sent to stdout

Looking at the value of containerInspect.State.OOMKilled, we can recognize whether the container is OOMKilled or not.

Our application monitors the condition of containers in fixed time interval, MONITOR_INTERVAL. The OOMKilled event that occurs within the interval will be reported.

Summary

We developed an application that can detect OOMKilled containers since we could not find any applications that completely served our needs. Now we have messages associated with the OOMKilled events on Cloud Logging, thus we will be able to easily build a system to notify us of them using a combination of Cloud Log Router, Cloud Pub/Sub and Cloud Functions.