An experiment of how to detect OOMKilled events on Container-Optimized OS with containerd

Takashi Kagimoto
The Zeals Tech Blog

--

Introduction

It is an important theme in the Google Kubernetes Engine (GKE) environment to detect containers that are OOMKilled by the operating system. I wrote a blog entitled “How to detect OOMKilled containers on the GKE environment" and explained how to monitor the OOMKilled events with a golang program. This program can work fine only on the Docker runtime environment, which is widely used by GKE users.

According to the official document of the GKE, the Docker runtime environment is deprecated since GKE node version 1.19, is not recommended since version 1.20, and will be removed since version 1.23 (see ChangeLog of Kubernetes). Google recommends that GKE users migrate to “the Container-Optimized OS with containerd (cos_containerd) variant, which is now the default GKE node image”. After migrating to the environment, we will no longer be able to detect the OOMKilled events of containers using the program described in the previous blog. That is, we need to develop another program to detect the events on the cos_containerd environment.

In this article, we are going to not only show a sample program code to detect the OOMKilled events of containers but also explain processes of how to find out the method through several experiments because there are fewer available documents about the cos_containerd environment.

Experiments

The experiments that we do are based on the Japanese blog. We refer to the source code of containerd as well.

The program code and GKE manifest that we use in the experiments are available from the author’s repository.

Node information

The information of the node that we use is as follows,

Preparation

In order to do the experiments, we need two pods: one is for the execution of golang programs, the other will be OOMKilled. We adopt a Docker image of nginx for the first one which can be easily used. The second one is a simple web server written in Python, which will allocate memory space for an array whose size is specified by a query parameter. If we select a simple pod that requires a larger memory space than specified in the manifest, the pod will fail to be spawned, but it is not OOMKilled (the exit code is 0).

Let’s deploy both.

Note that we have to deploy app-test to the same node on which nginx is running in order to capture any events occurring in the pod of app-test. Therefore after deploying nginx, we need to edit the manifest for app-test and set the value of nodeAffinity appropriately.

Check the namespace for cos_containerd

When we run the program,listup, described in the Japanese blog on the nginx pod, we get two namespaces: one is moby, the other is k8s.io. We should select the latter for the GKE environment.

Capture events of containers

Let’s make app-test OOMKilled. At first, a golang program, subscriber1, is copied to the nginx pod and executed.

What will happen when we execute the following commands,

The following log messages will appear in the terminal on which subscribe is executed. The event, /tasks/oom, is printed as expected. The other information that we obtain is only ContainerID, thus we need to convert ContainerID to the name of a pod the next time.

Convert ContainerID to the name of the pod

A list of containers running on a node can be obtained with the Containers function. Looking at the Container type, we realize that metadata extension for a container will be provided with Labels.

Let’s display the list of containers using this program.

The program, containers, will show the following association between ContainerID and the name of the pod.

The first and third lines correspond to the container which is OOMKilled, and the second line corresponds to the container which is newly created. We are not sure which one should be selected for the OOMKilled container, the first or the third, but we will adopt the line in which the value of io.cri-containerd.kind is “container”.

A sample program that detects OOMKilled events

According to the above experiments, we finally get a sample program that detects OOMKilled events,oomkill.go. A key part of this program is as follows,

Existing problem

Looking at values of Spec in the Container type of a container, we will see Hostname as a member. Unfortunately, its value is not set in the GKE environment. Therefore we cannot identify the node on which an OOMKilled event occurs.

Summary

We successfully develop a golang program that detects OOMKilled events in the cos_containerd environment of the Google Kubernetes Engine. After polishing this sample program a bit more and deploy it as a DaemonSet resource, we can monitor which container is OOMKilled even in the cos_containerd environment.

--

--