How to set a good only one threshold for an alert?

Gauthier François
Production. care
Published in
3 min readFeb 19, 2024

--

Did you ask yourself what is the good threshold for your alert setup?

I have worked on alerting system for more than 10 years in e-commerce or healthcare system. Setting good threshold(s) for an alert is very difficult and contentious.

I worked on the alert noise reduction and it appeared that one of solutions was to have only two alert status as “OK” and “Critical”.

Before talking about the good threshold for an alert, I would like to explain why an alert should have an only one threshold. Indeed as mentioned in this post, the warning status of alerts is anxiety-provoking and no efficient; I will explain why.

Why alert warning status is useless?

In several companies, I saw that the alert management was very tricky, especially caused by the warning and critical alert statuses. Most of the time, operators wait for warnings to turn into critical alerts.

At the time, I thought the situation was normal. I handled alerts like the rest of my mates. It wasn’t until I understood what alarm fatigue syndrome was, that I realized that alert warning status is not a good thing.

The warning state is mainly used in these following cases:

  1. Avoid the critical state (alert is treated as soon as possible)
  2. Alert that it will soon switch in critical status (often ignored/muted until status change)
  3. Prioritize which alert should be tackle in first (critical alert is more an emergency)

In the first two cases; it is not necessary to have two different alert states if alerts are processed as they are triggered. Keeping in mind that all alerts are important is the best way to take care of the production.

For the last one; warning and critical states should not be used to prioritize alert. Almost all of alerts management tools are including a dedicated feature to prioritize alerts. Moreover, I’m not convinced about alert prioritization value, certainly another blog-post 😉.

The most important thing to understand to set a good threshold is that alerts have to be tackled rigorously as go along as they pop. While this excellence reached, the threshold could be challenged.

The threshold should be set related to:

  • the risk for the production
  • the time to investigate/resolve
  • the complexity

Needless to explain that the alerts are set to prevent the production issues. However the threshold have to be set before the risk and especially to be quickly actionable.

When the alert pops up, the investigation, the resolution or both take time. The threshold have to be set not too near the issue in order to let the time for the operator to do its job. But it should be not too far to avoid the alarm fatigue syndrome (more detail about this phenomenon in this blog-post).

Some alerts could be complex to understand or investigate because they can have multiple root causes.

The best way could be to split this alert as many as root causes exist trying to help to solve quickest the issue. However if the alert is relevant enough, having a good threshold could be an indicator helping the operator to understand what about this alert.

Finding the good threshold is not easy and could be done after multiple short loop feedback.

Depending on the case, in the first time, the threshold could be set by the most appropriate operator or who have more ownership on this alert topic. In fact, the operators are the most relevant people to create relevant alerts. The first iteration should be set in super preventive mode. After the first alert iteration, the threshold need to be re-challenged like:

  • Is the production at risk when the alert triggered?
  • How more times could I have waited before tackle the alert?

Depending on the answers to these questions, you’ll decide how to fine tune the threshold.

Don’t be afraid to change your monitoring and alerting methods. Taking care of production means adapting monitoring and alerting stack to its needs. Abandoning the “warning” status is a first step towards making the alert and its threshold more relevant.

The good threshold must be halfway between the “dropping the alerting tool” and the “too late, the issue is already here” passing through challenging the alert “is this alert is relevant/actionable”.

--

--

Gauthier François
Production. care

Pragmatic SRE, production alerts killer, legacy maintainer, operational excellence advocate.