Photo by Nilantha Ilangamuwa on Unsplash

Kubernetes events — how to keep historical data of your cluster

Andrzej Kaczynski
5 min readMar 4, 2020

--

Recently, one of my production application POD has crashed due to some unknown reasons. I started investigation couple of hours later, when I came to the office. I took me almost 1h to found the main reason which turned out to be OOMKilled. I couldn’t find nothing in Kubernetes events. I tried to many available options, but finally I had to login to worker node and find the event message straight in docker:

docker events -f 'event=oom' --since '10h'

What I found in Kubernetes documentation, is that the kube-apiserver keeps the event history for just 1h. This can be customized by adding special flag to kube-apiserver (but most clusters use default value):

--event-ttl duration     Default: 1h0m0s
Amount of time to retain events.

Then I realized, that my cluster setup is missing the option to keep the historical data of all the events. It would save me a lot of time in the future, as well as it should be considered as a good practice.

I started to dig deeper in the Internet to found the solution. I usually have EFK stack installed, to keep logs of my applications. I found it a good place to start looking for the extend it with events and logs from the Kubernetes itself.

Then I found the Metricbeat, which is an official Elasticsearch beat (a family of log shippers for different use cases and sets of data). It comes with dedicated module to Kubernets. Its also available as a Helm chart from official stable repository. It requires kube-system-metrics installed in your cluster.

Installation was quite easy, but I found some challenges when trying to connect to my Elasticsearch instance (Managed AWS Elasticsearch). By default, it listens on port 443 (not 9200 as usual). It also has no basic authentication (username & password). Also Metricbeat used in conjunction with AWS Managed Elasticsearch must be Open Source (OSS). Official documentation is missing of that so I had to search for similar issues on Github and StackOverflow.

I used Helm v3 to install this on my cluster. This the installation procedure:

# Add stable repository to your helm
helm repo add stable https://kubernetes-charts.storage.googleapis.com
# Install kube-state-metrics
helm install kube-state-metrics -n kube-system stable/kube-state-metrics
# Install metricbeat
helm install metricbeat -n kube-system stable/metricbeat --values values.yaml

Default helm configuration is sufficient for most of deployments. However, metricset event is missing (the one which interest me the most). I had to add small adjustment to enable it in the values.yaml:

image:
repository: docker.elastic.co/beats/metricbeat-oss
tag: 6.7.0
daemonset:
overrideConfig:
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
processors:
- add_cloud_metadata:
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
deployment:
overrideConfig:
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
processors:
- add_cloud_metadata:
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
ssl:
verification_mode: "none"
modules:
kubernetes:
enabled: true
config:
- module: kubernetes
metricsets:
- state_node
- state_deployment
- state_replicaset
- state_pod
- state_container
- event
period: 10s
hosts: ["kube-state-metrics:8080"]
extraEnv:
- name: ELASTICSEARCH_HOST
value: "https://vpc-elasticsearch-xxxvvvbbbzzzssss.eu-west-1.es.amazonaws.com"
- name: ELASTICSEARCH_PORT
value: "443"

Helm chart of metricbeat will install DaemonSet, Deployment and couple of secrets with configuration. It will install also RBAC objects which allows the application on reading the metrics from the cluster.

When ready, open Kibana dashboard and add new Index. In case of AWS Managed Elasticsearch, the URL is like:

https://vpc-elasticsearch-xxxvvvbbbzzzssss.eu-west-1.es.amazonaws.com/_plugin/kibana/

Go to Management -> Index Patterns -> Create index pattern

metricbeat-*

Now you can search for events in Kibana.

For the test purpose, I created the pod from official Kubernetes documentation used to demonstrate OOMKilled behavior:

apiVersion: v1
kind: Pod
metadata:
name: memory-demo-2
spec:
containers:
- name: memory-demo-2-ctr
image: polinux/stress
resources:
requests:
memory: "50Mi"
limits:
memory: "100Mi"
command: ["stress"]
args: ["--vm", "1", "--vm-bytes", "250M", "--vm-hang", "1"]

The pod has started and has been killed due to memory limits (OOMKilled). I tried to find the relevant event in Kibana… then booom! I couldn’t find it! I double checked the events from Kubernetes, using below command (see column REASON):

kubectl get events
LAST SEEN TYPE REASON OBJECT MESSAGE
16s Normal Scheduled pod/memory-demo-2 Successfully assigned default/memory-demo-2 to worker1.eu-west-1.compute.internal13s Normal Pulling pod/memory-demo-2 Pulling image "polinux/stress"12s Normal Pulled pod/memory-demo-2 Successfully pulled image "polinux/stress"11s Normal Created pod/memory-demo-2 Created container memory-demo-2-ctr11s Normal Started pod/memory-demo-2 Started container memory-demo-2-ctr10s Warning BackOff pod/memory-demo-2 Back-off restarting failed container

There is no POD event “OOMKilled”, just “BackOff” one. The one which interest me the most is missing. What to do? When you describe the POD you can see something like:

State:          Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: OOMKilled
Exit Code: 1
Started: Wed, 04 Mar 2020 09:14:24 +0100
Finished: Wed, 04 Mar 2020 09:14:25 +0100
Ready: False

So there is something, but how to ship it to Elasticsearch ?? I went back to documentation again and found that there is a field kubernetes.container.status.reason for metricset apiserver. According to it, we can expect message type Waiting (ContainerCreating, CrashLoopBackoff, ErrImagePull, ImagePullBackoff) or Termination (Completed, ContainerCannotRun, Error, OOMKilled). The last one is what we look for! More information in below link:

In order to enable metricset apiserver, small adjustment in configuration is needed. Of course, we can do it through the values.yaml file. See below changes in deployment section necessary to enable it:

deployment:
overrideConfig:
metricbeat.config.modules:Metricbeat
path: ${path.config}/modules.d/*.yml
reload.enabled: false
processors:
- add_cloud_metadata:
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
ssl:
verification_mode: "none"
modules:
kubernetes:
enabled: true
config:
- module: kubernetes
metricsets:
- state_node
- state_deployment
- state_replicaset
- state_pod
- state_container
- event
period: 10s
hosts: ["kube-state-metrics:8080"]
- module: kubernetes
metricsets:
- apiserver
hosts: ["https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}"]

Then refresh the Metricbeat index in Kibana. After that, you can apply the filter to see OOMKilled PODs in your cluster:

kubernetes.container.status.reason: OOMKilled

Example log:

{
"_index": "metricbeat-6.7.0-2020.03.04",
"_type": "_doc",
"_id": "ZnnCpHABFngIVEsjbpV9",
"_version": 1,
"_score": null,
"_source": {
"@timestamp": "2020-03-04T08:56:09.755Z",
...
"container": {
"name": "memory-demo-2-ctr",
"status": {
"restarts": 6,
"ready": false,
"phase": "terminated",
"reason": "OOMKilled"
},
...

Thanks to that, I have logs related to OOMKilled in my Elasticsearch. In addition to that, I have also Kubernetes events and a lot of more metrics and cluster logs stored in persistent way :-)

Certainly it will save me a lot of time, when next time my POD will crash and I will be looking for the reason. Please keep in mind about this in your cluster deployment!

I hope you enjoyed the reading my post.

Thank you!

--

--