Anomaly Detection Workshop

Anomaly Detection; An Overview

Introduction by Willey Rempel

Yenson Lau
Aggregate Intellect

--

Photo by Clay Banks on Unsplash

This posts in this series is the collective work of the participants of the Anomaly Detection workshop organized by Aggregate Intellect. This post serves as a proof of work, and covers some of the concepts covered in the workshop in addition to advanced concepts pursued by students.

Contributors: Alec Robinson, Eric Djona Fegnem, Harriet Yu, Lindsey Peng, Michael Enquist, Peter Bacalso, Ramya Balasubramaniam, Samantha Cassar, Victor Reyes, Willy Rempel, Yenson Lau

Editors: Chris Bobotsis, Susan Shu Chang

An anomaly is by definition something that is outside the norm or what is expected. For data this can mean rare individual outliers or distinct clusters. Anomaly detection is an important capability with broad applicability in many domains such as medical diagnostics or in detection of intrusions, fraud, or false information. All three categories of model training are used for anomalous data; supervised, semi-supervised, and unsupervised. Typically the first go-to methods are statistical and classical machine learning techniques. These can be low cost and yet very effective. When the data is more complex, deep learning approaches are required to extract the relevant latent features.

Basic statistics of the data distribution, such as the standard deviation, can be immediately used to identify outliers (e.g. anything further than 3σ from the mean). Boxplots visually show possible anomalies at the tail ends of plot whiskers. More advanced methods such as Gaussian mixture models (GMM) provide a generative model to discriminate outliers from more complicated distributions. These methods assume a parametric distribution. Non-parametric techniques include histogram based outlier score (HBOS) and distance-based (clustering) techniques, such as k-nearest neighbours (k-NN) and local outlier factors (LOF).

The HBOS is produced by taking the product of the inverse histogram values for each feature of a sample. It is fast at the cost of less precision and is better suited for detecting global outliers vs. local ones. There are several options for using (k-NN), starting with a simple 1-NN where we score based on the distance from the closest neighbour. We could also score based on the average distance to k neighbours. LOF elaborates on k-NN by using the neighbourhood density of samples. First, the inverse local neighbourhood density, or local reachability desnity (LRD), of a sample x is computed as

The LRD is computed for all k neighbours of oNₖ​(x) of x and used for the final LOF score, which is a ratio of the neighbourhood density of x with respect to the densities of its k neighbours.

Other clustering techniques used for anomalies are:

  • k-means
  • hierarchical clustering
  • density-based spatial clustering of applications with noise (DBSCAN)
  • cross interaction based outlier score (XBOS)
  • cluster-based local outlier factor (CBLOF, uCBLOF), a clustered variant of LOF.

Isolation forests (iForest) are an ensemble of isolation trees where instances are isolated based on random selection of features and feature values for decision splits. Anomalies will be more susceptible to isolation and will have shorter average paths (averaged over all the decision trees). As the name suggests, One-SVM (support vector machine) creates a decision boundary around only one class: the set of all normal data. Any anomaly will lie outside the boundary. The advantage of one-class modelling is that we do not need to know about and account for every possible anomaly. Instead we accurately model as much as possible what ‘normal’ means for the data. Additionally we can use semi-supervised techniques such as data shuffling to generate fake/shuffled samples to better learn the decision boundary.

One-class classification also works very well for deep learning models. Autoencoders (AE) trained on normal data will have high reconstruction error scores for anomalous instances. The same is the case for any generative models which will learn the distribution of normal data. Variational autoencoders (VAE) can also utilize the KL-divergence score of an instance to indicate how unlikely is the data from the learned distribution. Adversarial autoencoders (AAE) add a discriminator to a VAE which discriminates between the instance latent embedding and some previously selected task-specific distribution.

GANs are another obvious choice where the discriminator from a trained GAN is used to detect outliers. Conditional GANs refine this by including conditional class labels to both the generator and discriminator. This is preferable as it favours the generator in the game, resulting in better discriminators. The first example of a conditional GAN is a repurposed VAE. The VAE is used for the generator and the condition is the input data itself. Both the actual and reconstructed data are sent to a discriminator. The generator is still used during inference, as any anomalous data will be poorly reconstructed, increasing the likelihood of a true positive. Adversarial Dual Autoencoders (ADA) takes this a step further and uses another VAE for the discriminator. GANomaly combines a VAE with the encoder half of another VAE, and also some other discriminator. This leaves us with three scores to work with:

  1. ​=∥z−∥₁ ​: The loss between the two latent embeddings of the generator VAE and the 2nd encoder
  2. ​=∥x−x̂∥₁​ : The normal VAE reconstruction loss
  3. = ∥z−f(x̂)∥₂​ : A binary real/fake score from the 2nd discriminator

Deep learning has the additional benefit of being able to work with sequential data using auto-regressive models and RNNs. Such models assign probabilities to sequence elements and can spot single element or whole sequence anomalies. Lastly, deep hybrid models combine both deep learning and prior techniques. For example, the latent code book of an autoencoder can be extracted and have k-NN applied for clustering the codes.spot single element or whole sequence anomalies.

In this workshop series, we discuss advanced techniques presented in three papers:

  1. The first paper, Temporal Cycle-Consistency Learning, uses the cycle-consistency concept to align sequential data streams with some form of natural correspondence between sequences, e.g. videos of water being poured from a pitcher.
    (Reviewed by Ramya Balasubramaniam, Yenson Lau, and Alec Robinson.)
  2. The second paper covered is GLAD: GLocalized Anomaly Detection via Active Feature Space Suppression. A neural network is used to modulate the output of LODA, an ensemble method for anomaly detection. Additionally, active learning is used in the training loop.
    (Reviewed by Victor Reyes, Lindsey Peng, Michael Enquist, and Harriet Yu.)
  3. Lastly, we investigate Unsupervised Anomaly Detection with Generative Adversarial Networks. A one-class technique is applied to medical images with AnoGAN; a generative adversarial network that learns the distribution of healthy tissue images.
    (Reviewed by Eric Djona Fegnem, Peter Bacalso, and Samantha Cassar.)

--

--