Using delta in Prometheus, differences over a period of time

David O'Dell
2 min readJul 15, 2019

--

Photo by Julian Hochgesang on Unsplash

You will at some point need to figure out the difference in the values of a metric over a certain period of time. We had a problem once where there was a container that was abusing it’s memory limit, hitting it and then start flapping, restarting over and over about once every 10 minutes or so. Since the pods would restart so fast, monitoring wasn’t catching the failures directly, we were noticing other issues. Anyhow, once we noticed the memory issue an immediate “get pods” told us that the pod had restarted well over 10 times. So we wanted to create a metric that could evaluate pod restart counts.

But… there’s a problem. We noticed that one of our other pods that delivered a schema to a database had a number of historically old and acceptable restarts — the pod was waiting to deliver the schema but couldn’t deliver due to latency in another process, so it would go into a restart cycle until it could get delivered. The delivery was successful, but the historical count of restarts was still there.

We needed to have a way to count the number of restarts in the past hour, rather, if there was a difference in the number of restarts above some level in the last hour. That’s where DELTA comes in.

Surprisingly, a delta expression is super easy to set up in Prometheus, I didn’t have to fight it or go find Mr. Brazil. It’s ready to use and does a great job!

The expression can be tested like this:

delta(kube_pod_container_status_restarts_total[1h]) > 5

and then placed into alertmanager like this:

- alert: Container status restarts total over 5 within the past hour
labels:
cluster: TEST
severity: critical
service: container
annotations:
description: "Pod {{ $labels.pod_name }}, container {{ $labels.container_name }} restarts total over 5 within the past hour"
action: "Contact support"
expr: delta(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m

--

--