Time series anomaly detection — in the era of deep learning

Part 2 of 3

Published in

Data to AI Lab | MIT

11 min readAug 28, 2020

by Sarah Alnegheimish

In the previous post, we looked at time series data and anomalies. (If you haven’t done so already, you can read the article here.) In part 2, we will discuss time series reconstruction using generative adversarial networks (GAN)¹ and how reconstructing time series can be used for anomaly detection².

Time Series Anomaly Detection using Generative Adversarial Networks

Before we introduce our approach for anomaly detection (AD), let’s discuss one of today’s most interesting and popular models for deep learning: generative adversarial networks (GAN). The idea behind a GAN is that a generator (G), usually a neural network, attempts to construct a fake image by using random noise and fooling a discriminator (D) — also a neural network. (D)’s job is to identify “fake” examples from “real” ones. They compete with each other to be best at their job. How powerful is this approach? Well, the figure below depicts some fake images generated from a GAN.

Karras, Tero, Samuli Laine, and Timo Aila. “A style-based generator architecture for generative adversarial networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. 2019.

In this project, we leverage the same approach for time series. We adopt a GAN structure to learn the patterns of signals from an observed set of data and train the generator “G”. We then use “G” to reconstruct time series data, and calculate the error by finding the discrepancies between the real and reconstructed signal. We then use this error to identify anomalies. You can read more about time series anomaly detection using GAN in our paper.

Enough talking — let’s look at some data.

Tutorial

In this tutorial, we will use a python library called Orion to perform anomaly detection. After following the instructions for installation available on github, we can get started and run the notebook. Alternatively, you can launch binder to directly access the notebook.

Load Data

In this tutorial, we continue examining the NYC taxi data maintained by Numenta. Their repository, available here, is full of AD approaches and labeled data, organized as a series of timestamps and corresponding values. Each timestamp corresponds to the time of observation in Unix Time Format.

To load the data, simply pass the signal name into the load_signal function. (If you are loading your own data, pass the file path.)

Though tables are powerful data structures, it’s hard to visualize time series through numerical values alone. So, let’s go ahead and plot the data using plot(df, known_anomalies) .

As we saw in the previous post, this data spans almost 7 months between 2014 and 2015. It contains five anomalies: NYC Marathon, Thanksgiving, Christmas, New Year’s Eve, and a major snow storm.

The central question of this post is: Can GANs be used to detect these anomalies? To answer this question, we have developed a time series anomaly detection pipeline using TadGAN, which is readily available in Orion. To use the model, pass the pipeline json name or path to the Orion API.

The Orion API is a simple interface that allows you to interact with anomaly detection pipelines. To train the model on the data, we simply use the fit method; to do anomaly detection, we use the detect method. In our case, we wanted to fit the data and then perform detection; therefore we used the fit_detect method. This might take some time to run. Once it’s done, we can visualize the results using plot(df, [anomalies, known_anomalies]).

Detect anomalies (red) vs. ground truth (green)

The red intervals depict detected anomalies, with green intervals showing ground truth. The model was able to detect 4 out of 5 anomalies. We also see that it detected some other intervals that were not included in the ground truth labels.

Although we jumped straight to the results, let’s backtrack and look at what the pipeline actually did.

Under the hood

The pipeline performs a series of transformations on the data, including preprocessing, model training, and post-processing, to obtain the result you have just seen. These functions, which we refer to as primitives, are specified within the model’s json file. More specifically, if we were to look at the TadGAN model, we find these primitives applied sequentially to the data:

Each primitive is responsible for a single task; each procedure is described in the course of this tutorial.

Preprocessing

Before we can use the data, we need to preprocess it. Preprocessing requires us to:

time_segments_aggregate divides the signal into intervals and applies an aggregation function — producing an equally spaced, aggregated version of the time series.
SimpleImputer imputes missing values with a specified value.
MinMaxScaler scales the values between a specified range.
rolling_window_sequences divides the original time series into signal segments.

Prepare Data — First, we make the signal of equal steps. Second, we impute missing values using the mean. Third, we scale the data between [-1, 1].

If we go back to the source of the NYC Taxi data, we find that it records a value every 30 minutes. Since timestamps are defined by seconds, we set the interval as 1800. We also opt for the default aggregation method, which in this case is taking the mean value of each interval. We also impute the data with the mean value. In this specific example, we can safely remove the time_segments_aggregate and impute primitives since the data is already equally spaced and does not contain missing values(of course, not all data is this pristine). Next, we scale the data between [-1, 1] such that it’s properly normalized for modeling.

After this, we need to prepare the input for training the TadGAN model. To obtain the training samples, we introduce a sliding window to divide the original time series into signal segments. The following illustration depicts this idea.

Generating training examples using sliding window

Here, X represents the input used to train the model. It is an np.array of size: number of training examples by window_size. In our case, we see X has 10222 training examples. Notice that 100 represents the window_size. Using plot_rws(X, k=4) we can visualize X.

Segmented signal with respect to window_size. For visualization purposes, we only show ***k=4*** windows. Meaning, for example, between window 150 and window 225 there are 75 other windows since we used a step_size of 1. (notice window 0 and window 1 actually look the same with the exception of one datapoint, this is pointed to by the green arrow. )

This makes the input ready for our machine learning model.

Modeling

Orion provides a suite of ML models that can be used for anomaly detection; such as ARIMA, LSTM, GAN, and more.

In this tutorial, we will focus on using GAN. In case you are not familiar with GANs, there are many tutorials that help you implement one using different packages, tensorflow, or pytorch.

To select a model of interest, we specify its primitive within the pipeline. To use the GAN model, we will be using the primitive:

TadGAN trains a custom time series GAN model.

Training— The core idea of a reconstruction-based anomaly detection method is to learn a model that can generate (construct) a signal with similar patterns to what it has seen previously.

GAN training for signal reconstruction. X represents the real dimension and Z represents the latent dimension.

The general training procedure of GANs is based on the idea that we want to reconstruct the signal as best as possible. To do this, we learn two mapping functions: an encoder (E) that maps the signal to the latent representation, “z”, and a generator (G) that recovers the signal from the latent variable. The discriminator (Dx) measures the realness of the signal. Additionally, we introduce a second discriminator (Dz) to distinguish between random latent samples “z” and encoded samples E(x). The intention behind Dz is to force E to encode features into a representation that is as close to white noise — as possible. This acts as a way to regularize the encoder E and avoid overfitting. The intuition behind using GANs for time series anomaly detection is that an effective model should not be able to reconstruct anomalies as well as “normal” instances.

To use the TadGAN model, we specify a number of parameters including model layers (structure of the previously mentioned neural networks). We also specify the input dimensions, the number of epochs, the learning rate, etc. All the parameters are listed below.

It might take a bit of time for the model to train.

Reconstruction— After the GAN finishes training, we next attempt to reconstruct the signal. We use the trained encoder (E) and generator (G) to reconstruct the signal.

Reconstructing time series using the GAN architecture

We pass the segment of the signal (same as the window) to the encoder and transform it into its latent representation, which then gets passed into the generator for reconstruction. We call the output of this process the reconstructed signal. We can summarize it for a segment s as: s → E(s) → G(E(s))≈ ŝ. When s is normal, s and ŝ should be close. On the other hand, if s is abnormal then s and ŝ should deviate.

The process above reconstructs one segment (window). We can get all the reconstructed segments by using the predict method in our API — X_hat, critic = tgan.predict(X). We can use plot_rws(X_hat, k=4) to view the result.

Reconstructed windows. The reconstructed windows overlap in regions depending on the window_size and step_size.

Per figure above, we notice that a reconstructed datapoint may appear in multiple windows based on the step_size and window_size that we have chosen in the preprocessing step. To get the final value of a datapoint for a particular time point, we aggregate the multiple reconstructed values for that datapoint. This results in a single value for each timestamp, resulting in a fully reconstructed version of the original signal in df.

Each time stamp will have multiple values based on window_size and step_size.

To reassemble or “unroll” the signal, we can choose different aggregation methods. In our implementation, we chose it as the median value.

We can then use y_hat = unroll_ts(X_hat)to flatten the reconstructed samples X_hat and plot([y, y_hat], labels=['original', 'reconstructed']) for visualization.

Reconstructed signal using GAN overlaid on top of the original signal

We can see that the GAN model did well in trying to reconstruct the signal. We also see how it expected the signal to be, in comparison to what it actually is.

Post-processing

The next step in the pipeline is to perform post-processing, it includes calculating an error then using it to locate the anomalies. The primitives we will use are:

score_anomalies calculates the error between the real and reconstructed signal, this is specific to the GAN model.
find_anomalies identifies anomalous intervals based on the error obtained.

Error Scores — We use the discrepancies between the original signal and the reconstructed signal as the reconstruction error score. There are many methods to calculate this error, such as point and area difference.

Point difference between original and reconstructed signal

Analyzing the data, we noticed a large deviation between the two signals, present in some regions more than others. For a more robust measure, we use dynamic time warping (DTW) to account for signal delays and noise. This is the default approach for error calculation in the score_anomaly method but can be overriden using the rec_error_type parameter.

During the training process, the discriminator has to distinguish between real input sequences and constructed ones; thus, we refer to it as the critic score. To think of it, this score is also of relevance to distinguish anomalous sequences from normal ones, since we assume that anomalies will not be reconstructed. score_anomaly leverages this critic score by first smoothing the score through kernel density estimation (KDE) on the collection of critics and then taking the maximum value as the smoothed value. The end error score combines the reconstruction error and the critic score.

Error score using reconstruction and critic score

Now we can visually see where the error reaches a substantially high value. But how should we decide if the error value determines a potential anomaly? We could use a fixed threshold that says if error > 10, then the datapoint should be classified as anomalous.

Detect anomalies (red) vs. ground truth (green), threshold = 10

While a fixed threshold raised two correct anomalies, it missed out on the other three. If we were to look back at the error plot, we notice that some deviations are abnormal within its local region. So, how can we incorporate this information in our thresholding technique? We can use window-based methods to detect anomalies in context.

We first define the window of errors that we want to analyze. We then find the anomalous sequences in that window by looking at the mean and standard deviation of the errors. For errors that fall far from the mean (such as four standard deviations away), we classify its index as anomalous. We store the start/stop index pairs that correspond to each anomalous sequence, along with its score. We then move the window and repeat the procedure.

We now have similar results as we saw previously. The red intervals depict the detected anomalies, the green intervals show the ground truth. 4 out of 5 anomalies were detected. We also see that it detected some other intervals that were not included in the ground truth labels.

Orion API

Using the Orion API and pipelines, we simplified this process yet allowed flexibility for pipeline configuration.

How to configure a pipeline?

Once primitives are stitched together, we can identify anomalous intervals in a seamless manner. This serial process is easy to configure in Orion.

To configure a pipeline, we adjust the parameters of the primitive of interest within the pipeline.json file or directly by passing the dictionary to the API.

In the following example, I changed the aggregation level as well as the number of epochs for training. These changes will override the parameters specified in the json file. To know more about the API usage and primitive designs, please refer to the documentation. How we set the model and change the values of the hyperparameters is explained in the mlprimitives library. You can refer to its documentation here.

Now anomalies holds the detected anomalies.

In this tutorial, we looked at using time series reconstruction to detect anomalies. In the next post (part 3), we will explore more about evaluating pipelines and how we measure the performance of a pipeline against the ground truth. We will also look at comparing multiple anomaly detection pipelines from an end-to-end perspective.

In addition to the vanilla GAN, we also introduce other neural networks including: an encoding network to reduce the feature space, as well as a secondary discriminator.
This tutorial walks through the different steps taken to perform anomaly detection using the TadGAN model. The particulars of TadGAN and how it was architected will be detailed in another post.