Time series anomaly detection — in the era of deep learning

Part 3 of 3

Published in

Data to AI Lab | MIT

9 min readOct 21, 2020

In the previous articles, we looked at the problem of time series anomaly detection and how we can use GANs to solve the problem (If you haven’t done so already, you can read part 1 and part 2). In part 3, we are going to discuss how to evaluate the performance of an anomaly detection pipeline against the ground truth.

How can we tell how good a pipeline is?

Compared to supervised learning, this is a tricky question — nevertheless an important question to ask. When a pipeline is trained, unlike supervised learning, one does not train to optimize finding any “known anomalies”. Rather, one trains to learn the signal patterns, detect anomalies and then check if the pipeline identified any “known anomalies”.

So where do these “known anomalies” come from?

When evaluating efficacy of a pipeline, we rely on signals, for whom we have annotations — known anomalies. Though we treat them as the “ground truth”, the fact of the matter is, these signals have been examined by experts — who understand them — in which they looked at the anomalies and possibly classified an interval as a problem or not.

Several signal datasets with their corresponding “known anomalies” exist. Here we present a table summarizing these datasets.

NASA, a spacecraft telemetry signals dataset.

Yahoo, consisting of real metrics of Yahoo service as well as synthetic signals.

NAB, consisting of various application domains.

While we cannot attest the process they went through to give us “known anomalies”, nevertheless they are a good point to start.

Are all "known anomalies" a problem?

In our first post, we identified the reasons why one might be interested in doing time series anomaly detection. Ultimately, to reiterate, the anomaly detection’s sole purpose is to bring them to the attention of a human operator — something may be wrong here. However, not all anomalies can be problematic and can be explained. And at other times, what is problematic to them might not be to you (and may not be to a statistical model as well). Therefore when we claim efficacy of a pipeline based on these datasets above, we have to be careful about declaring that it is the best possible pipeline for a signal from a real problem — the characteristics of which we do not know apriori. Anomalies vs. what is a problematic anomaly, taxonomy of anomalies, how to establish ground truth for a new problem (your real world problem) is something we will cover in a later article.

Evaluating a pipeline against a signal with "known anomalies"?

At this point we have the signals with known anomalies, a fitted pipeline (trained on the signal), so how would we go about evaluating efficacy? Perhaps, the easiest way is to take each anomalous data point in time, treat it as a positive class and use the metrics from binary classification, false positives, false negatives, and others. We have these metrics at the end of the article. This approach is probably the most widely adopted mechanism — but unforgivably it ignores taking into consideration the end use of time series anomaly detection — and the very fact that it is time series after all.

Ignoring the notion of time?

Regular and Irregularly sampled signals. The blue (triangular shapes) denote one time series and the green (circular shapes) denote another. They both span the same duration but the green signal is a regularly sampled signal meaning there is a data point corresponding to each t in time, whilst the blue signal is irregularly sampled

Let’s take the figure above as an example, we have two signals that span the same duration, namely 2 days. We have an irregularly sampled signal (blue) and a regularly sampled signal (green), which is depicted as the shape represented in each t in time. When evaluating the irregular signal, we notice that the earlier part (day 1) of the signal will be taken into consideration a lot more within the evaluation metric than the latter part (since we have 3 samples in the first day vs. 1 in the second day). This problem does not exist within regularly sampled signals. Instead of relying on the number of samples for evaluation, we opt for an assessment with respect to time.

Evaluation Approaches

In Orion, we provide a comprehensive approach to compare predicted anomalies to ground truth anomalies. We use two main approaches for this objective.

Assessing every segment in the detected anomalies with its counterpart in the ground truth, we refer to this approach as a weighted segment.
Assess the detected anomaly segment by seeing if we caught an overlap with the correct anomalies, we refer to this approach as an overlapping segment.

Each one of these approaches assess a different objective. We describe each one separately. But before we introduce these approaches. Let’s look at an example of a signal with its ground truth and detected anomalies. In the ground truth, we have three anomalies, and within the detected anomalies, we have two anomalies.

Example signal with its ground truth and detected labels. Notice how one anomaly was perfectly detected, another one was partially detected, and the last one was not detected.

1. Weighted Segment

Weighted segment based evaluation is a strict approach which weighs each segment by its actual time duration. It is valuable to use when you want to detect the exact segment of the anomaly, without any slackness. It first segments the signal into partitions based on the ground truth and detected sequences. Then it makes a segment to segment comparison and records TP/FP/FN/TN accordingly. The overall score is weighted by the duration of each segment. Visually, this operation is summarized by the illustration below. An interesting edge case to this approach is that, when your signal is regularly sampled, then it is equivalent to a sampled based approach of evaluation.

Weighted segment approach. Each vertical line depicts a partition. Each partitioned segment will be evaluated into a TP/FP/FN/TN based on the comparison of ground truth and detected segments.

2. Overlapping Segment

This is a more lenient approach of evaluation. It takes the perspective of rewarding the model if it manages to alarm the user of a subset of an anomaly. The idea of this approach is that behind an anomaly detector there are domain experts that monitor the signals. If an alarm is raised then the user will investigate the alarm. If the model partially identifies the anomaly, then the user will be able to look at the entire anomaly because it drew attention to its location. Hence, the anomaly is detected and the model should be rewarded. This approach records (1) TP, if a ground truth segment overlaps with the detected segment. (2) FN, If the ground truth segment does not overlap any detected segments. (3) FP, If a detected segment does not overlap any labeled anomalous region. This can be summarized by the illustration below.

Overlapping segment. Each detected anomaly will be assigned to either TP/FP, and missed ground truth anomalies will be assigned to FN.

Each method has its own advantage. Notice in the overlap segment approach, we do not account for true negatives (TN) and is invariate to time . Also, you might be thinking that, in this approach, detecting the entire sequence as an anomaly will give you a better score, which is true. That’s why we provide multiple approaches for evaluation.

We can then use the definitions of true positive, false positive, false negative, true negative to calculate relevant metrics (see end of the post for definitions).

False positives matter!

You might be inclined to use metrics that focus on identifying the anomaly, such as just true positives, but is it really the case that proper identification is the only thing that matters? In our anomaly detection problem, high false alarms (false positives) could overwhelm the monitoring team. More so, these false alarms could end up burying the true anomalies, which increases the likelihood of true anomalies being unaddressed. The ability of detecting the correct anomalies as well as not initiating many false alarms is important, therefore, a metric such as F1 Score gives us a broader understanding of how the pipeline is performing.

Evaluation in Orion

Now you might be wondering how can I put what I previously explained into code? We already handle that for you. We can use Orion’s evaluation subpackage to address this problem.

You can work directly with this code and example from the notebook. Simply follow the installation steps in Orion, Alternatively, you can launch binder to directly access the notebook.

Visually, this example is similar to the one we have previously seen within our explanation.

So how can we use the weighted segment approach and the overlapping approach to evaluate? In Orion, we provide a collection of metrics, such as precision, recall, and F1 Score, that provide both options for evaluation. We distinguish between both approaches using the weighted flag within the metric function.

Using the weighted segment approach, we can see that we detected the majority of the anomalous interval, which if we are to look at the problem from a time point of view, we notice that we managed to detect (4 + 2) seconds of true positive, 2 seconds of false positive, and 2 second of false negative. Putting these together into our equation will yield precision = 0.75, recall = 0.75, and an F1 Score = 0.75.

Using the overlapping segment approach, the result will be a bit different.

In the overlapping segment approach, we can see that we get higher scores. In fact, since we detected a partial segment of the ground truth, then we record that segment as a true positive. Concretely, in this approach, we record 2 true positives, and 1 false negative. Which yields precision = 1, recall = 0.667, and an F1 Score = 0.8.

End-to-end pipeline evaluation

In the previous post, we saw how we can perform anomaly detection end-to-end, but how do we evaluate? We integrate the evaluation suite into the Orion API, such that you can evaluate the pipeline on a dataset (with its labels) end-to-end. You can use orion.evaluate supporting the method with the following arguments:

data, a pandas.DataFrame containing two columns: timestamp and value.
ground_truth, a pandas.DataFrame containing two columns: start timestamp and end timestamp of ground truth labels.
fit, a flag denoting whether to train the pipeline before evaluating it.
train_data, a pandas.DataFrame containing two columns: timestamp and value, used to train the pipeline. If this dataframe is not given, the function will use data to train the pipeline.
metrics, a list of metrics used to evaluate the pipeline.

metrics is a list of function names that compares ground truth labels against detected labels and returns a metric value. We have seen some functions of that sort, such as contextual_f1_score.

In this post, we looked at evaluating detected anomalies v.s. the ground truth. We also saw how we can use Orion to evaluate pipelines in an end-to-end fashion. In the upcoming post, We will explore how we can compare multiple anomaly detection pipelines from an end-to-end perspective. As well as visit the Orion’s benchmarking suite.

Review of well-known terminology in binary classification, for a positive class P, and negative class N:

True Positive (TP). We classify a segment as true positive if the model predicted it as anomaly positive and the ground truth is also anomaly positive.
False Positive (FP). If the model predicted it as positive (anomaly) and the ground truth is negative (“normal”).
False Negative (FN). If the model predicted it as a negative class (“normal”) and the ground truth is positive (anomaly).
True Negative (TN). If both the model and the ground truth suggest that the segment is negative.

Classification metrics:

Precision (Pre): which is the fraction of correctly labeled instances over the total instances labeled. You can think of it as, from the positively classified instances, how many did I get right?

Pre = TP / (TP + FP)

Recall (Rec): which is the fraction of correctly labeled points over the total number of. You can think of it as, from the truly positive instances, how many of them did I identify?

Rec = TP / (TP + FN)

F1 Score: which is the harmonic mean of precision and recall

F1 Score = 2 · Pre · Rec / (Pre + Rec)