Time Series Anomaly Detection: The Detective’s Toolbox

Georgian

Published in

Georgian Impact Blog

9 min readNov 11, 2022

By: Benjamin Ye and Angeline Yasodhara

This is the second post of our series on time series outlier detection. You can read the first post here.

In our last blog, we have introduced the main families of scoring algorithms and how Georgian together with cybersecurity companies used these techniques to detect threats. In this post, we will take a look into the detective’s toolbox for a closer examination of these techniques.

I. Detection Methods

Recall from the last post that the first step in a time series anomaly detection system is to use a model that transforms time series into anomaly scores at each time step. Common methods include prediction-based, reconstruction-based, and distance-based models. Let’s examine each one in detail.

Prediction-Based Methods

Previously, we defined time series as an ordered set:

X = {x₀, x₁, …, xₙ₋₁, xₙ}

such that

xₜ = f(x₀, x₁, …, xₜ₋₁) + ϵ

Which indicates that a value at a certain time point, xₜ, is a function of values before it ( x₀, x₁, …, xₜ₋₁).

Prediction-based methods work as follows:

During training, fit an estimator to model xₜ for t=0, 1, 2, …, n. Ideally, only non-anomalous data should be fitted. In production, the models are periodically rebuilt to adapt to changes in time series.
At inference time, the anomaly score can be calculated by calculating the difference between the predicted and actual value for that time point.
Values that fall outside the prediction’s confidence interval can be directly classified as anomalous. Alternatively, we can also take the anomaly score and tune the threshold as discussed later in the blog.

Outside of the realm of anomaly detection, this estimator can also be used to predict future values of the time series — two birds in one stone!

*Example of anomaly detection with ARIMA, a prediction-based method. Values outside of prediction confidence intervals (in grey) are marked as outliers. Credit:* *stackexchange.com*

The estimator can be fitted from a variety of algorithms, from classical autoregressive models to state-of-the-art neural networks such as transformers. It must be noted that some models make strong assumptions on the temporal relationships of variables. For example, Autoregressive (AR) model assumes that current value is a linear combination of past values. Autoregressive Integrated Moving Average (ARIMA), on the other hand, combines AR with Moving Average, to enable non-linear representation (See this blog for a detailed explanation). ARIMA-based models are powerful, stable, theoretically solid, tools for forecasting. However, they are built on certain assumptions of the data and require more in-depth analysis to be properly used.

Machine learning and deep learning models, on the other hand, impose very little prior on the underlying structures of the time series, and are able to learn non-trivial and non-linear temporal dependencies. DeepAR, Prophet, NeuralProphet, and N-BEATS, N-HiTS (a more scalable version of N-BEATS), and Temporal Fusion Transformer (TFT) are some machine learning models for forecasting. However, they tend to demand much more data to achieve a good result or can be unstable.

Reconstruction-Based Methods

Another general approach for finding anomalies in time series is by using reconstruction-based methods. These methods work as follows:

Train a model to encode time series sub-sequences into a lower-dimensional latent space. Like prediction-based methods, only non-anomalous sub-sequences should be included and models are periodically rebuilt in production.
The model then attempts to reconstruct the original sub-sequence with the latent representation.
The reconstruction loss is then used as the anomaly score. Intuitively, sub-sequences that are common should be reconstructed well by the model, with little to no reconstruction loss. However, anomalous sub-sequences are not common and therefore, have higher reconstruction loss.

Common models used here are from the autoencoder family. Examples include VAE and its variants such as LTSM-VAE and Donut.

*Schematic of how LSTM-VAE uses reconstruction prediction error as anomaly score. Credit:* *Lin et al. (2020)*

Distance-Based Methods

Lastly, distance-based methods directly compute a distance measure or the anomaly score of a sub-sequence. The internals of these distance functions vary. For example, some calculate the distance between every other sub-sequence while some calculate the distance to the embedding of non-anomalous sub-sequences. Because of the diversity of implementation, it is recommended to research the underlying model to get a sense of what’s happening under the hood.

For example, Matrix Profile takes the minimum of the Euclidean distance of current sub-sequence against every other sub-sequences, while One-Class SVM (OCSVM) fits a boundary on normal samples and takes the distance to the boundary as anomaly scores.

*OCSVM draws a boundary along fitted non-anomalous samples. Credit:* *Li et al. (2020)*

Other models approximate a distance measure, such as Isolation Forest which isolates anomalous sub-sequences with multiple decision trees; sub-sequences that lie outside of normal tend to require less tree splits to be isolated — the inverse of the number of splits required to isolate the subsequence then becomes the anomaly score.

*Isolation Forest isolates anomalies with an ensemble of trees. Anomalous points are isolated earlier on. Credit:* *QuantDare*

With distance-based methods, not only can we pass in subsequences as is, we can also pass in engineered time series features such as seasonality and number of peaks to boost the performance of the model. Helpful libraries that calculate time series features include tsfresh and tsfel.

Unlike previous models, distance-based models are sometimes fitted (in the broadest sense) in an unsupervised fashion with the whole dataset, as in the case of Matrix Profile. Others, like OCSVM and Isolation Forest, still need to be periodically retrained like models from prediction-based and reconstruction-based families.

Ensemble

Amongst a large variety of scoring models, some perform well for specific anomalies but poorly for others. In the above example from Schmidl et al. (2022), Sub-LOF generates a good response when facing a pattern anomaly (anomalies that span multiple time points) but comes up with a score of zero for point anomaly.

Ensembling is, therefore, a natural extension for an anomaly detection system. Ensembling techniques have seen great success in time series anomaly detection competition hosted by UCR during KDD 2021, where four out of top five teams used ensemble techniques.

In addition, as discussed earlier, sub-sequence size matters a great deal. If we can combine the scores generated from different sub-sequence lengths, not only can we forgo the tuning of window sizes, we can also incorporate longer-range features.

II. Thresholding Methods

After we have obtained a series of anomaly scores, we need to decide on a threshold to separate the scores into anomalous and non-anomalous.

Quantile

For this step, the most common approach is to use a Quantile approach, where we label the top p percent of anomaly scores as anomalies. But the downside to this simplicity is that this thresholding method will guarantee the prediction of anomalies (as we always take the top p percent).

Z-Normalization

An approach to avoid this problem is Z-normalization, where we normalize the anomaly score to a standard normal distribution, and pick a threshold based on how many standard deviations the score is compared to the mean.

Peaks over Threshold (POT)

Another distribution-based threshold method is Peaks over Threshold (POT), where we separately model the tails of anomaly scores with a heavy-tailed distribution (most often Generalized Pareto Distribution) and label ones that are below a certain p-value threshold as positives.

Clustering

We can also approach thresholding using clustering. By assuming that scores from non-anomalous and anomalous data points form distinct clusters, we can use methods such as Jenks Natural Breaks (fast K-Means in 1-D) and Gaussian Mixture Model to come up with a boundary. But this also suffers from a guaranteed anomaly problem as the Quantile approach, as we assume the number of clusters k = 2. To tackle this problem, there exist methods such as the Hopkins statistic for clustering tendency to see if the scores do, indeed, form distinct clusters.

Of course, if we have labelled data, things become easier. For starters, we can generate a threshold that includes all known anomalies, or optimize for an acceptable precision and recall rate with a grid or Bayesian search.

III. Evaluation Metrics

Subtleties in Pattern Anomaly Evaluation

There are some subtleties when it comes to evaluating the performance of a time series detection system. For pattern anomalies, models are oftentimes unfairly penalized when using traditional supervised metrics such as precision and recall. This is because some models would only detect a subset of the anomalous span as anomalies. Thus, we can have situations where competent models may show artificially low performance while they perform well in real life situations (as we are often only interested in whether the model has detected an anomaly at all, rather than whether it could detect the length of anomaly).

To correct this discrepancy, most time series anomaly detection literature advocate for point adjustment, where we consider the whole span of anomalous data to be correctly labelled as long as a subsequence of it is correctly labelled. A more advanced variant imposes a penalty for delayed detection. For example, in the figure below, in certain cases, predictor #2 should be rewarded higher as it is able to catch the anomaly at an earlier time point.

*Example of point adjustment in pattern anomaly.*

Dealing With Unlabelled Data

When we have no labelled data, it becomes difficult to come up with sensible performance metrics. In practice, we have used heuristic measures such as the density of anomalies and changes in trends of surrounding values when tuning with unlabeled data. For example, we find the combinations of these scores to be helpful:

Average value distance of predicted anomalies to the mean
Number of timepoints between anomalies
Maximum range of non-anomalies sub-sequence
Differences in current slope vs that of surrounding sequences
Differences in current value vs mean of surrounding values

IV. Recent Advances

Transformers / Attention Mechanism

Recent advances in Attention Mechanism has enabled researchers to incorporate both short-range and long-range dependencies in time series data. Some examples of its usage can be found in Autoformer (predictive), Temporal Fusion Transformer (predictive) and TranAD (reconstruction) models.

Synthetic Data / Data Augmentation

With the proliferation of deep learning methods comes its demand for large quantities of data. Microsoft has had success training their SR-CNN model using synthetic anomalies in order to achieve superior performance versus classical models. Wen et al (2022) has also observed an uplift in detection performance after augmenting time series data.

V. Georgian’s pyoats toolkit

At Georgian, we recently published an easy-to-use toolkit called pyoats that supports quick calculations of anomalies scores for both univariate and multivariate data with support of over 15 models and various thresholding techniques.

Thank you for reading our overview of the time series anomaly detection toolbox! In the next post, we will go over the usage of the package for real-world data. In the meantime, please check out the pyoats’ Github repo and documentation should you wish to experiment with it right away. 🔎