Using Prometheus Offset to alert based on gradient

Our story

Rachel Newman
Go City Engineering
4 min readFeb 23, 2024

--

We were frequently being awoken by business processes which was causing a large amount of kafka messages to be produced, which was inevitably causing a lag on the consumer end. We had our alerts set up so that if the lag exceeded a threshold, this would trigger a p1 alert in the team as it was deemed urgent. This threshold was somewhat arbitrary and didn’t really account for real world cases, namely when the business process triggered it. We decided that just because the lag was high this didn’t mean there was something urgently wrong in the system — especially when the high lag was caused by the rapid production of messages in a short time as these were always steadily decreasing. Even though the lag was being dealt with properly by our system, our alert meant that while the number of messages was higher than our threshold the alert would be triggered. We realised that what we needed to know was ‘is the lag increasing — at any rate — and has it been like this for too long’.

Why did our alert need fixing?

  • A high consumer lag isn’t urgent if it’s going down
  • The threshold for ‘critical’ consumer lag was an arbitrary number (800)
  • Our critical alert woke up an engineer for something that did not need technical support (not infrequently)

What did we want instead?

  • Alert if the lag is going up for a long time — this would imply a bigger software issue
  • Engineers not being woken up for non-critical issues

Our Solution

We want to know when the gradient is positive for a period of time

We can use the gradient for a straight line equation:

A graph with a straight line through it, demonstrating the equation for the gradient: y2-y1/x2-x1
Gradient of a straight line

We made the decision that for our use case we only needed the DIRECTION of the line — is it increasing? We don’t care about the steepness of the slope for our use case (the magnitude of the gradient) — we just care about the sign of the gradient (whether it is positive or negative).

This means we only need y₂- y₁(because we know that x₂- x₁ is positive…given the nature of time and graphs)

Example of high kafka lag which is steadily decreasing

With a graph like this, we’d expect our alert not to trigger because the graph is going down — it has a negative gradient.

How do we get the value at y₂ in Prometheus?

Introducing
OFFSET

we know we need y₂- y₁…

we could think about this two ways: y₁ is now, and y₂ is in the future. This would be quite tricky as we don’t know the value of the graph in the future. So we must view it as y₂ is now, and y₁ is in the past.

We can get this using the prometheus offset modifier.

[your_metric] - [your_metric] offset 60s

for the example graph above, the gradient would look like this:

The parts of the graph below 0 are where the lag was going steadily down, these are the parts we don’t care about.

(sum by (topic, application) (kafka_consumer_fetch_manager_records_lag) - 
sum by (topic, application) (kafka_consumer_fetch_manager_records_lag offset 60s)) > 0

We chose to make our offset tend to as close to 0 as possible as this would give us the best approximation for the true gradient of the graph. We kept in mind our scrape interval on these metrics and trialled a few values to settle on 60s.

In our alerting rules we can combine this with the for clause:

expr: sum by (topic, application) (kafka_consumer_fetch_manager_records_lag) - sum by (topic, application) (kafka_consumer_fetch_manager_records_lag offset 60s) > 0
for: 30m

which means we will only get alerted if the lag has been increasing for at least 30minutes.

We can use this gradient calculation for any metric going forwards if we want to know a change in the metric in either direction. There may also be cases where we care more about the steepness of the slope, so we would be able to use the offset length as the denominator in our gradient fraction.

Edit: We found that tending to zero on the offset was the best way to get an accurate gradient for the graph, but due to the nature of the metric and how sporadic it can seem we realised we need to know the general direction. For this reason we increased our offset to 2m based on experiments with different values.

--

--