A Review of Anomaly Detection Metrics

11 min readJul 12, 2022

Since we are going to look into anomaly detection metrics, let’s define anomaly and anomaly detection first. In general, anomaly refers to a deviation from expected behavior, and anomaly detection denotes a problem (often a technique, too) of finding odd patterns that do not match expected behavior. Anomaly detection is important in different disciplines and applications, for instance, it is actively used to trace computer network penetration, predict equipment failure, detect banking fraud, observe climate change, and monitor patients’ health condition.

Another definition that we will need is the one of time series because I will talk about anomaly detection in time series data. Here it is:

A time series is a chronological sequence of points or samples representing a process behavior.

It isn’t an easy task to evaluate anomaly detection (AD) algorithms and select one of them for a specific purpose because there exist different mathematical problems, various metrics suitable for specific tasks, and other factors that may affect your choice of technique. Often researchers and data scientists select a standard metric, such as F1, only because it is recommended for classification tasks in general. I offer this overview of anomaly detection metrics to aid in this issue.

Disclaimer: This overview is a bit simplified and does not claim to be completely exhaustive.

Confusion matrix

Confusion Matrix is the crucial concept to understand the different outcomes of prediction. This is how it looks in a table:

Confusion matrix

And this is how to read the table:

tp (true positive): the prediction has been ‘anomalous value’ and it’s true;
tn (true negative): the prediction has been ‘normal value’ and it’s true;
fp (false positive): the prediction has been ‘anomalous value’ and it’s false;
fn (false negative): the prediction has been ‘normal value’ and it’s false.

Anomaly detection (AD) problems

Before I describe the issues that are most frequently identified as anomaly detection problems, we should pinpoint the anomaly categories. Anomalies are often split into single point and collective ones by the number of points in the anomaly. There is also a contextualanomaly type, and sometimes collective and contextual types are combined into a range-based anomaly type, but I will refer to it as ‘collective’ for the sake of simplicity and give a definition from here:

A collective anomaly is one that occurs over a consecutive sequence of time points, where no non-anomalous data points exist between the beginning and the end of the anomaly.

Difference between point and collective anomalies

I suggest you read my previous work for more information on the characteristics of anomaly detection problems.

Anomaly detection often transforms into a binary classification problem, also known as an outlier detection problem, or a change point detection problem. The figure below shows the relationship between these problems and anomaly types:

Relationship between problems and anomaly types

What does it mean? It means that you may want to employ binary classification algorithms for point anomalies, and both binary classification and change point detection algorithms for collective anomalies. In other words, change point detection algorithms are applicable only for collective anomalies because we want to find the exact change point where the collective anomaly begins. However, you can also think of a collective anomaly as a set of point anomalies, and that is why binary classification algorithms are applicable for both anomaly types.

Connection between problems and anomaly types

Classes of metrics

Anomaly detection problems can be linked with classes of metrics that are used to evaluate anomaly detection algorithms. Look at the figure below to see the relationship between anomaly detection problems and classes of metrics:

The metrics are categorized into the following classes:

Binary classification metrics help categorize the result of point classification into either normal or anomalous class with the ‘true’ label.

Here, it is good to remember that for change point detection problems, binary classification metrics are applied in a window-based way: we check if a predicted change point falls into the detection window OR we somehow compare overlapping of the predicted and true windows (by position, size, etc.).
For binary classification problems, binary classification metrics are applied in a point-based way: we check if the predicted label is correct for each point.

Window-based detection metrics (except binary classification ones) help match the predicted change point with the detection window around the actual change point in a way different from binary classification.
Detection time (or point #) error metrics help evaluate difference in time (or point #/index) between the predicted point and the actual change point.

Taxonomy of AD metrics

The anomaly detection metrics can be grouped and, when taken together, such groups of metrics are often called ‘taxonomy’ which means classification.

The figure here shows the taxonomy of anomaly detection metrics, and further on I give some short information about each metric. Also, I will give a link to the paper where all these metrics are presented in more detail with formulas and additional analysis.

Window-based detection (except binary classification) metrics

NAB scoring algorithm: This algorithm rewards early and correct detection and penalizes for false positives and false negatives. There is a special scoring function for this that is applied to the anomaly windows. Each window represents a range of data points that are centered around a ground truth anomaly. More details can be found in this paper.
RandIndex: Its value indicates that two (predicted and ground truth) segmentations either agree, or do not agree on pairs of points. More details are given in this paper.

Metrics for detection time (or point #) error

ADD (average detection delay) = MAE (mean absolute error) = AnnotationError: This metric denotes the difference between the predicted change point and the true change point and can be measured in time or number of points. The absolute value of the difference in time between predicted and actual change points is totaled and normalized over each change points. This paper has further information.
MSD (mean signed difference): Additionally to MAE, this measure considers the direction of the error, predicting if it has occurred before or after the actual change point time. More details are given in this paper.
MSE (mean squared error), RMSE (root mean squared error), NRMSE (normalized root mean squared error): These are the alternatives to MAE. However, the errors are squared, and the resulting measure will be very large if a few dramatic outliers exist in the classified data. More details can be found in this paper.
ADD at a particular average run length to false alarm (or ADD at a particularlevel of probability of false alarm): This is a performance criterion for quantifying the propensity of a detection algorithm to generate false alarms. Check out this paper for more information.
Hausdorff: This metric is equal to the biggest temporal separation between a change point and its prediction. More details are in this paper.

The fundamental book “Sequential Analysis: Hypothesis Testing and Changepoint Detection” gives more metrics that are similar but useful for some specific purposes and change point detection procedures, such as worst-case mean detection delay, integral average detection delay, maximal conditional average delay to detection, mean time between false alarms, and many more. This book includes advice how to select criteria or metrics for change point detection in various applications, for example, quality control:

… the best solution is the quickest detection of the disorder with as few false alarms as possible. … Thus, an optimal solution is based on a tradeoff between the speed of detection or detection delay and the false alarm rate, using a comparison of the losses implied by the true and false detections.

Binary classification metrics

FDR (fault detection rate) = TPR (true positive rate) = Recall = Sensitivity: This is the ratio of true positive data points (change points) to a total number of true anomalous (or change) points (i.e. true positive plus false negative results in the confusion matrix). More details are presented in this paper and this paper.
MAR (missed alarm rate) = 1 — FDR: This is the ratio of false negative data points (change points) to a total number of true anomalous (or change) points (i.e. true positive plus false negative results in the confusion matrix).
Specificity: This is the ratio of true negative data points (change points) to a total number of normal points (i.e. true negative plus false positive results in the confusion matrix). More details can be found in this paper.
FAR (false alarm rate) = FPR (false positive rate) = 1 — Specificity: This is the ratio of false positive data points (change points) to a total number of normal points (i.e. true negative plus false positive results in the confusion matrix). It serves as the measure of how often the false alarms occur.
G-mean: This is a combination of Sensitivity and Specificity. Refer to this paper for further information.
Precision: This is the ratio of true positive data points (change points) to a total number of points classified as anomalous or change points (i.e. true positive plus false positive results in the confusion matrix). More details are presented in this paper.
F-measure: This is a combination of weighted Precision and Recall (The F1 measure is the harmonic mean of the precision and recall). More details can be found in this paper.
Accuracy: this is the ratio of correctly classified data points (or change points) to total data points (change points). Refer to this paper for more details.
ROC-AUC (Receiver Operating Characteristic, area under the curve), PRC-AUC (Precision-Recall curve, area under the curve): These are useful tools when predicting the probability of a binary outcome. More details are presented in this paper and this article.
MCC (Matthews correlation coefficient): This is the measure to identify the quality of binary classification, which considers all of true positives, true negatives, false positives and false negatives. More details are presented in this paper.

Difference in binary classification metrics for specific AD problems

A change point detection problem can be described as:

TP: the number of correctly detected change points (# of tp)..
FP: the number of points incorrectly identified as change points (# of fp).
TN: the number of normal points correctly identified as normal (# of tn).
FN: the number of missed change points (# of fn).

Description for outlier detection problem:

TP: the number of data points correctly identified as anomalous (# of tp).
FP: the number of normal data points incorrectly identified as anomalous (# of fp).
TN: the number of normal data points correctly identified as normal (# of tn).
FN: the number of data points incorrectly identified as normal (# of fn).

When we identify tp, fp, tn, fn for outlier detection and change point detection problems, the main difference is that for outlier detection, each data point is identified as either anomalous or normal, but for change point detection, each change point (or each true anomaly) is identified as detected or missed.

More binary classification metrics for anomaly detection in time series is given in this paper.

Loss functions vs anomaly detection criterion vs anomaly detection metric: Example

To make it clear, I would like to use an example to explicitly explain which part of the anomaly detection pipeline this article focuses on. Refer to the figure here to see the anomaly detection pipeline and the metric part highlighted in red.

Simplified scheme of a common anomaly detection pipeline

Step 1. Prediction (Process model): use a machine learning model that has been trained with typical real-world data, to predict a signal one point ahead at each moment of time. If the model fails to match the signal, it indicates that the values deviate from the typical data.

Step 2. Anomaly detection criterion: the anomaly detection algorithm compares the error function (absolute error) with the threshold that has been previously established during the model training step. Absolute error grows when the values deviate from the typical data, used for the model training. When the threshold is exceeded, an anomaly is detected.

Step 3. Evaluation: this is the step in the pipeline that this article focuses on.

The article doesn’t cover loss functions (cost functions, error functions) that get optimized during the training of machine learning algorithms in Step 1. However, functions (criterion, metric) from Steps 2 or 3 may be selected as a loss function to maximize the model’s quality.

Python realisation of AD metrics in TSAD

First of all, let’s import some libraries:

import pandas as pd
import numpy as np
import sys

try:
    import tsad
except:
    import sys
    sys.path.insert(1, '../')
    from tsad.evaluating.evaluating import evaluating

Next, let’s init the data with true labels and predicted labels:

true = pd.Series(0, pd.date_range('2020-01-01', '2020-01-20', freq='D'))
true.iloc[[6, 14]] = 1

prediction = pd.Series(0, pd.date_range('2020-01-01', '2020-01-20', freq='D'))
prediction.iloc[[4, 10]] = 1

pd.concat([true, prediction], 1).reset_index().head()

This is the output:

Input for evaluating using default (out-of-box)NAB metric in TSAD:

results = evaluating(true=true, prediction=prediction)
print(results)

And the output:

Since you don't choose numenta_time and portion, then portion will be 0.1
Standart  -  -5.5
LowFP  -  -11.0
LowFN  -  -3.67
{'Standart': -5.5, 'LowFP': -11.0, 'LowFN': -3.67}

I hope this article was helpful. We are currently developing an open-source library named TSAD (Time Series Analysis for Diagnostics). TSAD’s major goal is to make life easier for researchers who use machine learning (and deep learning) techniques for time series data. This repository contains further information. If you want to learn more about evaluation of anomaly detection algorithms using TSAD, you may want to check out this page here.

References

This overview refers to the following papers:

Ahmed, Mohiuddin, et al. “An investigation of performance analysis of anomaly detection techniques for big data in scada systems.” EAI Endorsed Trans. Ind. Networks Intell. Syst. 2.3 (2015): e5.
Aminikhanghahi, Samaneh, and Diane J. Cook. “A survey of methods for time series change point detection.” Knowledge and information systems 51.2 (2017): 339–367.
Truong, Charles, Laurent Oudre, and Nicolas Vayatis. “Selective review of offline change point detection methods.” Signal Processing 167 (2020): 107299.
Artemov, Alexey, and Evgeny Burnaev. “Ensembles of detectors for online detection of transient changes.” Eighth International Conference on Machine Vision (ICMV 2015). Vol. 9875. International Society for Optics and Photonics, 2015.
Tatbul, Nesime, et al. “Precision and recall for time series.” arXiv preprint arXiv:1803.03639 (2018).

A Review of Anomaly Detection Metrics

Confusion matrix

Anomaly detection (AD) problems

Classes of metrics

Taxonomy of AD metrics

Window-based detection (except binary classification) metrics

Metrics for detection time (or point #) error

Binary classification metrics

Difference in binary classification metrics for specific AD problems

Loss functions vs anomaly detection criterion vs anomaly detection metric: Example

Python realisation of AD metrics in TSAD

References

Written by Iurii Katser