How we avoided alarm fatigue syndrome by managing/reducing the alerting noise.

Published in

Doctolib

6 min readApr 15, 2022

Did you find yourself so overwhelmed with alerts that you felt desensitized by them? In Doctolib, as the company grows bigger and bigger, we can be caught up in this issue as well. Here’s how we found a methodology to reduce noise on alerts based on KPI (key performance indicators).

The most important thing that we discovered, during the noise reduction process establishment, is defining the most representative KPI and especially not changing it during the process. We can’t take care of human feelings as a KPI because the noise is not the same pain for each person. A numeric KPI is the most relevant metric as a guide to tackle the noise.

I’ve been working at Doctolib for 6 years now, and we have used different monitoring and alerting tools. In the beginning, that was too easy, we didn’t have any noise because the infrastructure was pretty small and there weren’t too many services. The small size of the infrastructure allowed us to handle all alerts as it goes along.

Alarm fatigue syndrome

Over these last 6 years, I saw, as the infrastructure has scaled up and the number of Doctolib services multiplied, the number of problems increased. Trying to avoid Doctolib downtime, we developed a lot of monitoring and alerts warning us of each issue on our infrastructure and services.

However, our tool-set makes our monitoring and alerting a little bit noisy, starting to generate the alarm fatigue syndrome.

The risk of the noise is simple to understand. The more noise we are facing, the less consideration we will have on our alerts. The other risks related to this syndrome are:

Missing alerts
Being slower for alert handling
Dropping the alerting tool

It seems obvious that the solution to our issue was already written down in the alarm fatigue syndrome link and the way to tackle the problem. But where to start? The most noisy? What does “most noisy” mean?

Before doing any action we answered the following questions, helping us to define the way for reducing the noise:

What is the definition of noise?
What is the current state?
What is the target we would like to reach?

Before any explanation regarding the noise definition and how to tackle them, we had to export statistics about our alerts and their status during their life.

At Doctolib we use PagerDuty to route our alerts to the on-call person. We export PagerDuty statistics of alerts everyday.

That will be a good start to help us define our KPIs.

Noisy definition, KPI and target

The definition of noise is directly related to the alerts handling process and the acceptable acknowledgement time.

We tried to identify what the definition of the noise is and how to transform it into KPI following the signal noise reduction evolution.

We found that one self-resolved alert with a duration under two minutes is a noisy alert. The duration is calculated with the date delta between triggering and resolution of alerts.

How did we find and decide that this is a noisy definition for us?

We have a daily ritual called “duty”, during open hours, where one of the team members is in charge of looking after production. This role mainly consists of the following missions (summarized):

Handle production alerts
Handle inbound requests

We consider that in two minutes the duty is not able to:

Be aware of the alert,
Go to the alerting and monitoring tools to understand it,
Investigate the issue,
Solve it.

This is why we choose the “under two minutes alerts” as a noisy definition.

Below is the number of noisy alerts we faced in Q3 2021 (The graph has been truncated to 100 for better visibility):

The KPI to follow is known, we defined a target to reach for the signal noise reduction. We kept in mind the following things:

The achievement of this target will depend on the time we would like to spend/allocate
The effect regarding alerts fixes will not immediately be shown (we kept some budget spare)
We could face some spike if an already misconfigured alert hasn’t been triggered yet (spamming alerting tool)

In our first round of signal noise reduction, we decided to set the target as the number of total alerts in Q3 divided by 3. In our case, we wanted to reduce from around 2400 alerts to 800.

We drew a specific graphic helping us to track the number of noisy alerts evolution:

It’s very important to not redefine or change the KPI target definition during the stream. Otherwise, it would be hard to check if we are on track and what we want to achieve.

Methodology: how do we reduce the noise?

Assuming the KPI and the target are defined, we’ve been working on a way to help us tackle the noisy alerts.

To make the decision easier about which alert should be tackled in priority, we created a new tabular graphics showing the number of occurrences for each alert (sorted by name and by occurrence). However, an issue appeared in our case. We almost had each alert with 1 occurrence. Indeed, a lot of our alerts have a specification in their title like, the name of the server/pod or the environment.

To go further with the title constraint, we decided to truncate the title trying to remove title specificities. That helps us to aggregate alert titles displaying the real occurrences of which ones are noisy (sorted by number).

Occurrences with and without truncated title

As a daily ritual, we glanced at this chart every day in the morning and set aside time during the day to tackle the loudest alert from the previous day.

We used the following steps to decide what we should do for a noisy alert:

1 — Challenge the relevance by these questions: Is this alert actionable? What is its value? Is it still relevant?

If we decided to remove it because of a lack of relevance but it was useful for investigation; We kept it as a metric chart in our main dashboard.

2 — Check the redundancy: We only kept the most valuable alert.

3 — Check the algorithm of evaluation.

3.1 — We checked the “group by’’ aggregation for the metrics. Is it relevant to have 1 alert per web server instead of an error ratio?

3.2 — We checked the time frame of the alert’s evaluation. Is it relevant to have all alerts within 1 minute of evaluation? The main thing we made, in this case, is to replace 1 minute with 5 minute alert evaluation.

4 — Check the threshold.

If all of these previous checks are not helping us to tackle the alert, we adapted or fine-tuned the threshold.

This method is pragmatic and flexible; therefore we sometimes change the priorities or the order of the steps when a quick-win was identified to tackle an alert that is in the top 5 of noise for yesterday.

We used this method multiple times in order to reduce the noise step by step, quarter by quarter decreasing the target for each iteration.

In the last 2021 quarter, we successfully decreased the amount of noise compared to the previous one.