A practical guide to anomaly detection for video streaming analytics

Sergey Arsenyev
Lumen Engineering Blog
8 min readMay 10, 2023

--

At Lumen, we assist video streaming platforms in enhancing user experience and resource utilization through our client-side products. To ensure optimal performance, platforms need a reliable alerting system capable of detecting anomalies swiftly and accurately. In this article, we will outline our approach to creating a time series anomaly detector using analytics data from our client-side products.

This guide is applicable to any time series data, not just streaming analytics data. However, the choice of the model will depend on the nature of the data.

Step 1: Define anomaly detector requirements

From the outside, an anomaly detector looks like a black box that takes real-time metric sequences as input and sends anomaly alerts. We aggregate critical metrics (such as session duration) by broadcaster, user device, and stream type, and feed the aggregated data to the anomaly detector hourly.

The heart of an anomaly detector is a model that can be a neural network, statistical profiling, or many others. In the most general form, the interface between a detector and a model can be summed up in one question:

Given a past sequence of points and a current sequence of (not necessarily consecutive) points, how strange is the current sequence?

This interface fits models of different kinds: forecasting models that attempt to predict current values and compare them with observation (e.g. seasonal decomposition), and models that directly score the strangeness of the current sequence (clustering methods, auto-encoders).

To properly define an anomaly detector, we need to address the following questions:

How to handle data points with insufficient sample size? In the real world, we don’t always have confidence in the observed values due to a low sample size. Removing them from the model’s input is often not an option as the input size must be fixed for some models (e.g. neural network). We decided to add the sample size information in the model’s input as an additional dimension. We also ignore “anomalies” that are the results of low sample size in the current sequence.

How long should the past sequence be? The answer depends on what seasonal patterns you want to account for. In our case, we take one month of past data to capture daily and weekly seasonal effects.

How long should the current sequence be? Analysing only a single point (latest observed value) gives the most rapid alerts but suffers from unavoidable random variations, even after accounting for the sample size. On the other hand, if an anomaly persists for n points, an alert will be much more reliable, but we will only be notified after n hours from the start of the anomaly. In our case, we decided to use n=2 as a good trade-off.

What should the time gap between the past and current sequences be? For many models, including points just before the current data might not be a good idea, as they may over rely on the last point and predict the current sequence too well. Indeed, this would mean that our anomaly detector is never “surprised”, even when an actual anomaly happens. Instead, a detector needs to be robust and not overweigh last-minute data. For some of the models listed below, we applied a 6-point gap.

How to introduce variable sensitivity? An anomaly detector must have variable sensitivity, which lets us find the right balance between alerting too often and not capturing some actual anomalies. It also helps us trace the detector’s precision-recall curve and compare different detectors (more on this in Step 4).

Step 2: Label some data

One of the most important questions in building an anomaly detector is whether or not to venture into manual labelling of anomalies. By definition, an anomaly is a rare event, and labelling a dataset as sufficient for fully supervised model training is often too time-consuming. However, it is still important to have some labeled anomalies for scoring models. Otherwise, comparing different models boils down to “it seems to do fine on this time series” and is a lot less scientific.

This is why we decided to manually label anomalies over two months of real-time series. Points with less than the required sample size were forbidden from being labeled as anomalies.

Each time series was labeled by 3 different people using the Trainset tool. The final labeled dataset was created by considering a data point as an anomaly if at least 2 out of the 3 labellers counted it as such. Labellers were instructed to keep the number of consecutive anomaly points from 2 to 6. This ensures that we do not annotate single-point anomalies, and stop the anomaly sequence as soon as the anomaly itself becomes the new normal.

Examples of manually labelled anomalies

The final post-processing of the labeled data included:

  • adding or removing neighbouring points to respect the consecutive points requirement
  • adding one month of preceding data without annotations used for the initial fit of the models
  • final sanity check and subsequent exclusion of about 1/3 of annotated time series. The exclusions were made when anomalies, despite being voted for by the majority, made little sense, or some clear anomalous patterns were not labeled as such.

Step 3: Choose a model

The choice of a model was already partially discussed in the previous article. Here we will list several models that we tried recently, from the most basic to the most complex.

MinMax model: compares the observed values with the rolling min and max of the past data. We used a 7-day window for the past data and a 6-hour gap between the past and current sequences.

Statistical profiling: calculates the mean and the standard deviation of the past data. The mean is used as a forecast for current data, while the standard deviation serves to apply upper and lower bounds.

Seasonal decomposition: breaks down time series data into trend, seasonal, and residual components, enabling the identification of anomalies with the residual component.

Facebook Prophet: open-source Python library that incorporates advanced features on top of seasonal decomposition. These features include the detection of trend breakpoints, handling outliers, and accounting for holiday effects.

LSTM forecaster: a type of recurrent neural network trained to forecast future data. Such a model can be trained on unlabelled data as all it cares about is value prediction, not anomaly. However, in addition to the value, we also need its confidence interval to assign an anomaly score. Techniques such as Monte-Carlo dropout can be applied to estimate the confidence interval.

LSTM auto-encoder: in this case, an LSTM net is used to encode and decode a sequence in a way that only captures its essential information. An anomaly score is determined by how well the reconstructed sequence resembles the observed data.

Stack of several models: a combination of several models, which often performs better on unseen data. We used a mix of Prophet with a simple MinMax model.

Step 4: Score anomaly detector performance

Our goal is to achieve the highest possible accuracy in anomaly prediction, which entails two essential requirements:

  • alerting every time an actual anomaly occurs (100% recall)
  • avoiding false alerts when observed values are not genuine anomalies (100% precision)

Individually attaining each of these requirements by adjusting a model’s sensitivity is relatively straightforward. For instance, a model classifying all data points as anomalies would achieve 100% recall but suffer from low precision.

The true challenge lies in simultaneously achieving high precision and high recall.

For each anomaly detector, we trace the precision-recall curve, shown below for one particular metric. Note that in our case of heavily imbalanced populations, the precision-recall curve is more informative than the often-used ROC-AUC curve (since the false positive rate is always close to 0 for any practical anomaly detector).

Precision-recall curves for different anomaly detectors. Every point on a curve corresponds to a specific cautiousness (sensitivity) level of the model. An effective model should have its curve positioned close to the ideal point, the upper right corner.

In addition to the curves for the anomaly detectors, we included a point for “unbiased human” anomaly detection. For this, we asked an individual who was not involved in creating the labeled dataset (hence unbiased) to classify the data in the same manner as in Step 2.

One might argue that in practice, a real value of an anomaly detector is not due to the correct identification of anomaly points as much as anomalies themselves. This means that as long as a prediction overlaps with a true anomaly for some points, it should count as a correct prediction. Below we modify the traditional precision-recall chart to look at prediction per anomaly, and not per individual points:

Same plot as above but considering detection of entire anomalies instead of individual anomaly points.

Conclusion

Following this guide helped us build an anomaly detector that performs at least as well as a human eye. The best performance for the metric shown above is achieved by the stacked model of Prophet and MinMax, which outperforms both of its components. An example below shows how an anomaly was annotated differently by different anomaly detectors.

Example of anomaly detection for several models tuned to their best-performing sensitivities. The shaded area is the confidence interval of a model.

This article is based on work done together with Karim Abdouli and Lumen CDN R&D team.

This content is provided for informational purposes only and may require additional research and substantiation by the end user. In addition, the information is provided “as is” without any warranty or condition of any kind, either express or implied. Use of this information is at the end user’s own risk. Lumen does not warrant that the information will meet the end user’s requirements or that the implementation or usage of this information will result in the desired outcome of the end user. All third-party company and product or service names referenced in this article are for identification purposes only and do not imply endorsement or affiliation with Lumen. This document represents Lumen products and offerings as of the date of issue.

--

--