Applying machine learning techniques to real-time KPI monitoring

Published in

Lumen Engineering Blog

9 min readSep 1, 2020

Introduction: strong KPI monitoring

As a startup, Streamroot was always strongly rooted in using data to improve our software and our customer support. We often shared insights into how our data pipeline works and how we have refined our statistical analyses to better understand how our software interacts with the millions of devices it runs on every day.

As our business grows following our acquisition by CenturyLink, this vision has only been strengthened. As a provider of device-side video delivery to premier media companies, we strive to ensure that our service works as expected 24/7. This means not only using data to improve our software, but also using it to detect incidents quickly and accurately so that our support teams know when and how to respond.

That is why we created an in-house tool composed of modules that help us monitor our entire data pipeline. One module monitors the evolution of our customers’ Key Performance Indicators (KPI) such as the number of playback errors, the number of dropped frames, or even the amount of re-buffering. It aims to detect anomalies in these time series. [Time series are sequences of data points sampled in time. Here it can be interpreted as the different values a KPI had for the last 2 hours / days etc.]

After a couple iterations, we ended up using a deviation-based approach based on Holt-Winters’ seasonal forecasting method [ref: https://otexts.com/fpp2/holt-winters.html]. This is a model that predicts the next data point that will be observed. Then, based on how wrong the model was, the point is labelled as an anomaly or not.

Screenshot of the tool interface. The blue curve stands for real data points observed while the two other curves delineate the prediction envelope (that was computed before the points were observed). In other words, an anomaly is detected as soon as the blue curve gets out of the envelope.

Unfortunately, this has proven to be impractical for several reasons:

The false positive rate (FPR: percentage of detections that were not anomalies) was too high, which led support teams to distrust the tool.
The parameters used to tune the quality of detection were difficult to choose.
The detection delay was several hours (due to the mechanisms we added to alleviate the high FPR).

We therefore decided to start fresh and explore machine learning approaches to detect anomalies. After lots of dead-ends, modifications, and optimization, we arrived at two designs that we tested extensively and wanted to share with you.

This topic is an introduction to anomaly detection and the types of machine learning techniques that can be used to identify outliers.

General thoughts about anomaly definition

Before digging in, we would like to emphasize the definition of an anomaly. One of the biggest challenges in anomaly detection is to clearly define what we mean by an “anomaly” in the first place.

What companies often really want is to detect well-defined incidents — in the video streaming industry, for instance, an increase in re-buffering or playback errors. However, given the number of different devices, network conditions, or even software versions, it has become increasingly difficult to keep track of all the possible anomalies that can occur. Having a broad definition of “anomaly” allows us to consider the many forces that are at work behind the simple KPIs. We might end up detecting an incident that doesn’t justify human intervention, but at least we’ve been notified. Similarly, a data analyst could notice that the traffic at midnight was higher today than what it was for the past two weeks. The very fact that she detected a change means it’s an anomaly — just not one that is going to keep support teams up at night! This broad definition is here to ensure a good sensitivity of the system. The higher the sensitivity, the fewer False Negatives (FN) will occur.

The word “anomaly” has many synonyms in the literature: “outlier,” “deviation,” and “irregularity.” All these words refer to the same thing: “Something unusual, abnormal.” But as always, an answer raises 10 times more questions: “But what is normal?” and “Where do we draw the line between normal and abnormal?” … (And the answers to those are: “It is what we decide.” and “We don’t!”).

What is normal?

The key question to ask isn’t “What is an anomaly?” but “What is normal?”. It is far easier to describe an anomaly by what it is not. What’s normal is what’s expected, what we observe the most. This means that an anomaly is what we observe the less, or what’s unexpected. The rarer the event, the more anomalous it is. Intrinsically, if we manage to make a system learn what normal is, we should be able to present it with new samples and see how far it is from what’s expected. What is normal in our study case are the patterns we observe daily or weekly.

Where do we draw the line between normal and abnormal?

We expect from an anomaly detection system to alert support teams of anomalies. Either an alert is sent, or it is not. This means that the system has a binary output: normal or abnormal. However, we claim that this limits its usability. First, if instead of telling that a point is anomalous, we were to give this point an anomaly score (from 0 to 1 let’s say) we would have more flexibility. First, we can decide on different levels of severity given that severity follows the anomaly score. Then we might combine the scores of different KPIs to launch alerts for more general concepts. For example, we could group the playback error and re-buffering KPIs to have an anomaly detection on a broader concept: QoS. This way we would be able to launch QoS anomaly alerts for a given customer.

Machine learning approaches to anomaly detection

When we started to investigate the recent advances in anomaly detection and the current works in progress, we landed on an interesting paper from An and Cho [1]. In it, they classify the different approaches for anomaly detection into three categories: statistical, proximity-based, and deviation-based. They state the classical opposition of the Bayesian and frequentist points of view. Should we compute the probability of the observation, or compare it to some prediction? Plus, there is an approach based on clustering-like techniques. This classification is a good benchmark to analyze what’s present in literature.

Statistical anomaly detection

From a statistical (Bayesian) point of view, the hypothesis is that data observed has a specific probability distribution and thus an underlying probabilistic model. The aim is to choose some model or family of models and learn their parameters from the available data.

Then when a new data point is received, its probability is computed from the parameters learned. If this new data point is an anomaly and the model was trained on normal data, the probability should be low. This is assuming that anomalies are rare events, which means that they represent a small proportion of the dataset. In an unsupervised scenario we don’t have any way to avoid having anomalies in the training data.

To decide if a data point is an anomaly, we compare the computed probability with a threshold. Finding the right threshold would probably require having some labelled data.

Diagram of a process using the statistical approach

Proximity-based anomaly detection

Proximity-based detection is based on clustering-like methods. When a new data point needs to be classified, we compare it to the dataset. To decide whether it is an anomaly, we look at the local density, the average distance of the k-nearest neighbors, or the size of the closest cluster. The assumption here is that an anomaly point will not be distributed the same way as the normal ones. When working with high dimensional data points (sequences of values for example), we can apply a dimension reduction method like PCA (Principal Components Analysis) in order to end up with more meaningful clusters.

Diagram of a process using the proximity approach

Deviation-based anomaly detection

A final approach, and maybe the most popular in anomaly detection systems, is deviation-based detection. In their paper, An and Cho only talk about auto-encoder based methods, but we will also include prediction-based methods in this category.

They describe a method in which dimension reduction is applied to some data points, and the data point is then reconstructed from this condensed representation. This can be managed with auto-encoders but also with PCA. Then anomaly detection is done by comparing the reconstruction error to some threshold.

Prediction-based methods, on the other hand, are specific to time-dependent data such as time series. Using a prediction model like ARIMA or more advanced ones, we compute a prediction point. Then we compare it to the real data point obtained and decide if it constitutes an anomaly.

Diagram of a process using the deviation approach

Using auto-encoders to learn normality

To decide what strategy to choose, we explored the different uses of these techniques in literature and experimented. The models used for a statistical approach are quite sophisticated, making them a bit ambitious for our project. We then tried projection methods for a proximity-based approach. However, the projection we obtained did not allow us to separate time series containing anomalies from normal ones. Thus, we decided to adopt a deviation-based approach, and after examining the state of the art, auto-encoders emerged as an appropriate technique.

Auto-encoders (AE) and variational auto-encoders (VAE) are the most popular approaches [2][3][4][5]. They are used to learn what “normal” time series are. When trained, the AE is tasked to reconstruct time series; we then compare the reconstruction to the real time series to assess the error and decide if it contains an anomaly or not.

Auto-encoders are a type of artificial neural network that aims to extract useful features from high-dimensional data. They can also be seen as dimension reduction methods. The most basic auto-encoder is composed of three layers: an input layer x and an output layer x’ of the same dimension n, as well as a hidden layer h of dimension m<n. The dependence between the layers can be expressed with the following equations:

W, W′, b, b′ are the weights and biases that will be tuned. 𝜎 and 𝜎′ are activation functions (ex: Relu, Sigmoid, etc.).

The first part of the network tries to condense the information from x into h. The second does the opposite. From h it reconstructs something as similar as possible to the input x. We optimize auto-encoders by minimizing the reconstruction error, usually the mean squared error. The training step is done via the classic error back-propagation.

Auto-encoders can also be composed of more layers: we call them deep auto-encoders. They typically stack simple auto-encoders and usually display a symmetric architecture with regard to the central hidden layer. The architecture isn’t limited to fully connected layers like the previous example. Many researchers have explored different ways to build such a model based on other networks techniques such as RNN [3] and CNN [2].

Looking ahead: applying the auto-encoder approach to our video time series data

In this first article, we introduced the problem of anomaly detection in our KPI monitoring and why it was important for our operational teams to have accurate assessments of what constitutes an issue when it comes to our CDN Mesh Delivery and CDN Orchestrator products. We also defined anomalies and presented several approaches that we examined for this project. Finally, we explained why at CenturyLink we chose to adopt a deviation-based method using auto-encoder.

This blog is provided for informational purposes only and may require additional research and substantiation by the end user. In addition, the information is provided “as is” without any warranty or condition of any kind, either express or implied. Use of this information is at the end user’s own risk. CenturyLink does not warrant that the information will meet the end user’s requirements or that the implementation or usage of this information will result in the desired outcome of the end user.

Sources

[1] J. An and S. Cho, “Variational autoencoder based anomaly detection using reconstruction probability”, 2015.

[2] T. Wen and R. Keyes, Time series anomaly detection using convolutional neural networks and transfer learning, 2019.

[3] T. Kieu, B. Yang, C. Guo, and C. S. Jensen, “Outlier detection for time series with recurrent autoencoder ensembles”, 2019.

[4] R.-Q. Chen, G.-H. Shi, W.-L. Zhao, and C.-H. Liang, Sequential vae-lstm for anomaly detection on time series, 2019.

[5] S. Russo, A. Disch, F. Blumensaat, and K. Villez, Anomaly detection using deep autoencoders for in-situ wastewater systems monitoring data, 2020.