Using Machine Learning and Time Series Forecasting for Alerting

Jos van de Wolfshaar
MessageBird
6 min readAug 22, 2019

--

As any developer will tell you, incidents are a pain — and the later you spot them, the more damage they can cause. Alerts can help identify incidents quickly, but configuring them, choosing the right threshold, and finding the time to get it right is not always so straightforward. At MessageBird, we built a tool that solves some of these difficulties to help us detect incidents sooner and avoid the pesky setbacks. 🚀

Setting The Right Thresholds

For a given subset of metrics in our monitoring ecosystem, we can more-or-less differentiate between regular and irregular behavior, but setting thresholds for alerts is tricky nonetheless. Normally, we’d study our dashboards, examine how our service behaves over time and how metrics can vary during the day, and then magically pick an alerting threshold number.✨

But this process can be a pain. Sometimes, you pick a number that will not fire an alert on time. Sometimes, you pick a number that results in alerts firing every few hours or so, resulting in ‘alert noise’ that makes it hard to spot the true issues.

With our internal service we’ve named Nostradamus, we use machine learning to minimize the risk of choosing poor numbers and provide a more advanced alerting system. 📈

Why Nostradamus?

Nostradamus (1503–12–14 to 1566–07–02) was a French astrologer, physician and reputed seer. He’s best known for his book Les Prophéties, where he allegedly predicts future events in 942 poetic pieces. We accidentally discovered that some of his powers were implemented in the Prophet library by Facebook. We figured we could hand over these powers to all engineers at MessageBird 🐦 by creating a Nostradamus service that augments our Prometheus monitoring ecosystem with Prophet’s machine learning powers. We’ll now go over what these powers are and what you can use them for.

What Does Prophet Do?

Facebook’s library enables us to fit so-called “general additive models” (GAMs) onto any one-dimensional continuous time series we provide. GAMs are statistical models from which we can obtain confidence intervals based on samples of the posterior predictive distribution. They can be seen as shallow machine learning models (as opposed to e.g. deep neural networks) with only a single input: time. Many expressions we put into our dashboards and AlertManager actually yield a one-dimensional time series. For example, if we take the number of birds flying over our office per second: sum(rate(birds_flyover_counter_total[1m])), we see the following observations for 2019–05–28 until 2019–06–11:

The above observations exist in one dimension: time. Hence, Prophet can be used to show us a forecast of the number of birds per second. Think of it as a weather forecast ⛅️, but rather than predicting temperature, we can now predict any kind of expression in Prometheus. We could predict the number of birds per second with the expression sum(rate(birds_flyover_counter_total[1m])) or the median weight of birds with histogram_quantile(0.50, sum(rate(birds_weight_bucket[1m])). Here’s what the forecast of the number of birds per second looks like with default Prophet settings, using 7 days of history predicting 2 days ahead:

The plot visualizes the following:

  • Black dots ⚫️: historic observations which are fetched from Prometheus and fall within the high-confidence area of the Prophet algorithm. These can be seen as regular observations;
  • Red dots 🔴: historic observations which are fetched from Prometheus and fall outside the high-confidence area of the Prophet algorithm. These can be seen as irregular or anomalous observations;
  • Blue shaded area: The high-confidence area of the Prophet algorithm. This area is where the model is confident about observing data. In other words, (future) observations should be mostly within this region. The region is enclosed by two solid blue lines and also contains a solid blue line in the middle. The middle line shows the value for which the algorithm’s confidence is highest;
  • Red solid line: The global trend of the time series as predicted by the model. This gives you a crude sense of whether your metrics are going up or down. It is usually not the most informative thing to look at.

You can see that our model can actually predict where future observations are likely going to be. To be honest, the predictions here seem not too satisfying. In fact, if we were to predict even further ahead, the model seems to suggest we get to zero birds per second! 🙀

In practice, machine learning algorithms often still require some manual help to make their outputs useful. So far, we have just provided the raw data to the algorithm with default parameter settings. Apparently, this is not giving us a credible forecast. We can improve the quality of the predictions significantly by telling the model what is certainly irregular in the given history of observations. The Nostradamus UI enables the user to provide this information to the algorithm. For the number of birds per second, it makes sense to treat everything above 6 birds per second as irregular. All observations that contain more than 6 birds are ‘cut off’ (i.e. they are completely ignored by the Prophet algorithm), resulting in the following graph with the cut off observations shown as purple dots:

Wonderful, now that looks like it could’ve been Michel de Nostredame himself! 🎉 Due to the fact that we omitted the extremely high values during the 27th of June, the algorithm’s high confidence area becomes narrower and looks more natural at first sight. Clearly, there are outliers that fall above the blue area, but this is just a higher amount of birds than expected, which is only a good thing — the more birds the merrier.

So How Is This Useful?

One thing we cannot do with this type of machine learning is predict when issues will occur. This is because the model will tell you what’s regular and what isn’t — not when or to what extent irregularities will occur. By definition, an irregularity is something you cannot predict when the only input variable is time (unless you manage to reverse engineer Michel’s brain).

Adaptive Alert Thresholds 🚨 🆒

Nevertheless, we can still use the high confidence areas that Prophet spits out as adaptive alert thresholds. Imagine that you have a metric that hovers around 900 during the day and around 100 during the night. Observing values of 100 during the night would be OK, but observing this value during the day could be a sign of an issue. With the default approach of constant thresholds that often occur for expressions in AlertManager (e.g. sum(rate(birds_flyover_counter_total[2m])) < 100), it’s hard to express this kind of ruling. With the high-confidence areas given by Prophet, setting up adaptive thresholds becomes super easy 😍. So whereas the model cannot directly predict the occurrence of an irregular event, it can be of great help in defining smart alerting. 😎

Below is an example of a situation where an alert could have been fired based on the lower bound of the blue shaded area on the 12th of June:

To make it easy for our engineers to integrate this within their current monitoring ecosystem, our challenge was to bring the outputs from Prophet to Prometheus directly, so that they could use the built-in AlertManager the way they were used to. We created a service with a UI that enables anyone to quickly add Prometheus expressions after which the Nostradamus service exposes the high confidence area bounds and the predicted mean to Prometheus.

So there you have it, alerts based on statistical confidence intervals, neato!

--

--

Jos van de Wolfshaar
MessageBird

Machine Learning Practitioner, Music Enthusiast, building next-gen communication technologies at MessageBird.