How we adapted our business alerts to make them more relevant avoiding false alerts?

Gauthier François
Doctolib
Published in
7 min readDec 9, 2022

Have you ever suffered from your business alerts which are not relevant enough causing too many false alerts?

At Doctolib, as the company wants to know if its features are working as expected, we work hard to make our business monitoring as efficient as possible.

Alarm fatigue syndrome representation

Four years ago Doctolib decided to create it’s own implementation of SLI to be sure our features work as intended. Concretely, following some functional issues, features should be more observable and measurable especially to trigger alerts if they are in bad shape.

Our SLI implementation relies on our monitoring tool which is event based. The application pushes events to the provider which is then requested to display the usage of our features.

Doctolib applications are mainly used by practitioners to manage their appointments during office hours. Out of these hours, especially during nights and week-ends, the platform is not under a lot of stress.

In this article, we will discuss about an SLI which tells if our booking system works as expected. In fact, we would like to be alerted if the usage of this feature is unusual. We have the number of appointments taken by user types (mainly practitioners in this example and patients) by minute.

The Issue

We decided to create an alert based on a metric generated by events sent from our application.

Without going into further detail, TLDR; we export data every minute from the event based system provider and insert them as a metric in Datadog.

We wanted to detect if our booking system is in good shape or not, so we created a Datadog simple monitor, metric based, to detect when a sudden drop appeared. We compared the last 10 minutes with the 10 minutes prior. Here is the absolute evaluation in Datadog:

Datadog query with absolute approach

However, we faced false alerts because we have sporadic traffic during the night. Undoubtedly, the number of appointments taken is not as linear as in office hours, which made our alerts irrelevant outside these hours.

We created some statistics about how many relevant alerts have been triggered.

Percentage of real vs false alerts before adaptation

First iterations

In Datadog, it’s possible to create a monitor named “composite” which is a meta monitor composed of multiple monitors evaluating their results through logical expression.

Datadog composite monitor is composed of the id of each monitors

The query is juxtaposing all monitors id with “&&” or “||” operators. Remember that if the monitor status is red (threshold violated), the value is true. In the example mentioned above, if both statuses are red, the result of this logical expression would be true (true && true), and the composite monitor would be triggered.

Correlate with other monitors

We decided to correlate the previous monitor with another one which was detecting the abnormal traffic conditions. We tried different mathematical algorithms using derivation and median.

Datadog query with proportional approach

Consequently, we created a composite monitor which was formed by a logical expression of these two previously created monitors.

Example of Datadog Composite monitor

The creation of the above-mentioned composite monitor more or less improved the quality of the alerts, however, the false positive alerts were still appearing. These monitor compositions were not enough and we continued to face false monitoring alerts.

We tried to fine-tune the SLI thresholds and the expression but ended up over-tuning them. The SLIs became so irrelevant that they were no longer able to trigger an alert when an actual issue appeared.

Downtime and seasonality

What about the Datadog downtime functionality?

Datadog downtime feature makes you able to mute alerts punctually or recurrently.

We tested the Datadog downtime functionality and it worked as expected. However, the industrialization of downtime was a bit complex and didn’t work as expected. Moreover, this functionality was no longer reliable after monitor changes and it didn’t manage the public holidays.

What about a one week comparison instead of the last 10 minutes prior?

We tested the seasonality for a week. We compared the metric from a week instead of the last 10 minutes. However, we had to deal with the issue of public holidays (in France, we have 11 each year), during which the traffic is no longer relevant.

Verdict

The main issue is that the traffic is not predictable during this following cases:

  1. During the night, traffic is completely sporadic.
  2. The school holidays are a period with various traffic.
  3. Traffic is different for each day of the week. Especially Saturday and Sunday.

The hacky solution

We decided to reconsider all that we made, focusing on our needs. We would like to align our business alerts with context. To help us define what the solution will be, we asked ourselves :

  1. What is the goal of an SLI?
  2. What are our expectations about the “relevant alert” definition?

The goal of our SLIs is to answer this question: Does my feature work as intended?

Having relevant alerting means that on one hand, we are providing an alerting mechanism to warn us if our features are in bad shape, and on the other hand, that the whole organization is 100% confident about alerts.

To make our SLIs as relevant as possible we decided to use the composite monitor functionality helping us to define the alert context. The context is mainly defined by answering the following questions:

  • Should SLI trigger an alert if the feature usage is representative enough?
  • Should SLI trigger an alert during the week-end?
  • Should SLI trigger an alert during the public holiday?
  • Should SLI trigger an alert during non-business hours?

We called these new monitors “helpers”, because they help us to make our SLIs more relevant by defining context to alert us.

Should SLI trigger an alert if the feature usage is representative enough?

As I previously mentioned, our metrics are events based. These events come from endpoints in our application which push events into our provider.

As a metric we export the number of unique users who called the endpoint during the last 1 minute and we created a monitor on this metric with a threshold from which the traffic is considered significant enough.

Exporter query to get the number of unique users who called the appointment endpoint
Monitor displaying if the traffic on endpoint is significant enough

Should SLI trigger an alert during the week-end?

Function to get the day number of week

We created a metric monitor with a threshold “<= 5” that signifies whether we are on weekdays (Monday to Friday, or not).

Monitor displaying if today is a week-end

Should SLI trigger an alert during the public holiday?

We inserted 1 or 0 to know if today is a bank holiday using a ruby gem called “holidays”

Method to get if today is a public holiday

We created a monitor with a threshold “<1” which tells us if we are on a public holiday or not.

Monitor displaying if today is a public holiday

Should SLI trigger an alert during non-business hours?

We inserted the concatenation of the hours and minutes as a metric.

Method to insert the time as a value

In get_time_metrics, the hours are represented by values from 0 for midnight, to 2359 for 11:59 pm.

For example, to handle night, we && these two monitors:

  • one to cover from 10 pm to 11:59 pm: “>2200”
  • one to cover from midnight to 8 am: “<800”
Monitor covering from 10:00 pm to 23:59 pm
Monitor covering from 00:00am to 8:00 am

New SLI format

Now that we have “helpers” to make our SLIs more relevant, we defined 2 possibilities for formatting our new SLIs.

The first one is composed of the feature alert and the feature usage:

SLI related to the significant traffic on the endpoint

The second one is composed of the feature correlated with hours, public holidays and week-end:

SLI related to the context with week days, hours and public holidays

It is worth mentioning the reason why we decided to use exclamation marks (!). As you can see, we prefix all “helpers” monitors by “!” but not the feature monitor. The mark means that the result is true when the monitor is green. In our case, the condition to trigger an alert is when the feature monitor is red and all of the “helpers” are green. If one of the helpers is red, the expression is false and cannot trigger the alerts.

We took this action mainly for visual purposes. Therefore, when an alert is triggered, identifying the problematic alert is much simpler.

Moreover, you can read the composite monitor like a sentence replacing the “!” by “not” and “&&” by “and”:

Datadog composite monitor is composed of the id of each monitors

The feature monitor is broken and not “Hour is before 8am” and not “Hour is after 10pm” and not “Today is a public holiday” and not “Today is weekend”:

Datadog composite monitor is more readable with “!”

After several attempts, we stopped trying to have perfect SLIs because of our sporadic traffic. We just identified that we need to take care of the context of our SLIs. We would rather have relevant but imperfect alerts than have too many false positives or no alerts at all: done is better than perfect.

Percentage of real vs false alerts after adaptation

We developed this in-house solution due to the fact that no tool on the market provides such helpers.

--

--

Gauthier François
Doctolib

Pragmatic SRE, production alerts killer, legacy maintainer, operational excellence advocate.