Deep Semi-Supervised Anomaly Detection

Manmeet Singh
Analytics Vidhya
Published in
6 min readOct 31, 2020

--

Anomaly detection (AD) is the task of identifying outliers in a given dataset. There are existing shallow supervised, as well as deep unsupervised techniques which are limited in either scalability or their ability to use labeled anomalous data.

In this article, we will be discussing deep semi-supervised anomaly detection (Deep SAD) as originally published by Ruff, et al. [1].

Motivation behind Deep SAD

Figure 1. (a) shows the distribution of training data, along with labels, (b-e) show 4 existing techniques and the contour maps learned by their model. (f) shows deep semi-supervised anomaly detection

At a high level, we can compare the performance of the existing techniques with Deep SAD, and look at the representation learned by each model. The training data is presented as i) labeled normal ii) labeled outlier iii) unlabeled.

Unsupervised AD (b) only learns the distribution of the normal representation. This leads to blurry boundaries between normal and anomalous data. The outcome of this shortfall are low confidence detection of anomalies.

Supervised classifier (SVM) (c) successfully creates a maximal separating hyperplane between normal and anomalous boundaries, but would fail to detect anomalies in the case they are out-of-distribution (OOD).

The semi-supervised classifier (d) only learns the anomalous class, and would very likely fail at detecting any anomaly distributions that it was not pre-trained to detect.

Semi-supervised LPUE (Learning from Positive and Unlabeled Examples) (e) is an existing method that does not consider labeled anomalous data for training. The short coming is evident in the lack of a clear separation between the two distributions of labeled anomalous data (part of training set) and normal data.

Deep SVDD

An existing unsupervised technique named Deep Support vector data description (SVDD) is used as the motivation for Deep SAD. The objective of this technique is to train the neural network (phi) to learn a transformation that minimizes the volume of a data-enclosing hypersphere centered on a pre-determined point c.

Equation 1: Deep SVDD equation representing an unsupervised AD framework based on OOD samples

Penalizing the mean squared forces the network to extract common factors of data variation which are the most stable in the dataset. As a consequence, normal data points get mapped near the hypersphere center, while anomalies are mapped further out.

Deep SVDD equation is equivalent to minimizing the empirical variance and thus an upper bound on the entropy of a latent Gaussian. In simple terms, this technique minimizes entropy around point c and within the arbitrary hypersphere.

Deep SAD

Equation 2: Deep SAD equation with a supervised learning expression the second term. This second term optimizes weights based on label

This equation is the same as Deep SVDD, except for the second expression.

In the second expression, η (eta) is hyperparameter controlling the amount of emphasis placed on labeled vs unlabeled data.

Parameter m represents labeled samples and the y-bar represents a value of -1 or 1 depending on whether it belongs to anomalous or normal distribution, respectively.

We can summarize Deep SAD objective as modeling the latent distribution of normal data, to have low entropy, and the latent distribution of anomalies, to have high entropy. As discussed in previous section, Deep SAD increases the affinity of weights to be closer to c for the normal distribution, and further out for the anomalies.

Experiments

Anomaly detection tasks are performed on 3 datasets: MNIST, Fashion-MNIST and CIFAR-10.

For multi-class datasets, one class is chosen as the normal distribution, while others are deemed anomalous.

We discuss 3 distinct scenarios below for performance comparison between various techniques.

Scenario 1: Adding labeled anomalies

In this scenario, the effect of ratio of labeled anomalies on various techniques is examined. The labeled anomalies are sampled from one of the nine anomaly classes. This means that out of the ten classes, one is labeled as the normal distribution, and one is labeled as anomalous. For the unlabeled part of the training set, only the normal class samples are provided (key difference between scenario 1 vs. 2). Although only one class is represented in the labeled data, all 9 anomaly classes are used for testing. The ratio of the one labeled anomalous class is increased up to 20%, as show on the x-axis of Fig. 2.

Figure 2. Ratio of labeled anomalies in the training set (x-axis)

Results: This test demonstrates the benefit of Deep SAD (pink color) as a semi-supervised technique for AD, especially on the more complex CIFAR-10 dataset where Deep SAD performs the best. Overall, Deep SAD generalizes better as more labeled anomalies are presented for training.

Scenario 2: Polluted training data

In this scenario, the robustness of different methods to increased number of unlabeled anomalies in the training set is examined. To do this, the number of unlabeled anomalies in the training data is increased, as compared with scenario 1. Labeled training samples are fixed at 5% of total training data. This scenario is expected to favor unsupervised learning over supervised learning due to the noise in the training data.

Figure 3. Ratio of unlabeled to labeled training data (x-axis)

Figure 2 shows the results for varying ratio of pollution (outlier types) in the unlabeled training data

Results: Performance of all methods decreases with increasing data pollution. Learning from labeled anomalies in a semi-supervised AD approach alleviates negative impact pollution has on detection performance. This is because similar unknown anomalies in the unlabeled data may have been encountered during training phase. Deep SAD proves to be most robust.

Scenario 3: Number of known anomaly classes

In this scenario, the performance between various algorithms is compared at various numbers of known anomaly classes. In scenarios 1 and 2, we have been providing only one labeled anomalous class, out of 9, as part of the training data. Here, we examine the effect of increasing that number on various methods.

Figure 4. Number of known anomaly classes

Results: As the number of anomaly classes increases, Deep SAD performs better overall than other techniques. The more diverse the labeled anomalies in the training set, the better the detection performance becomes.

Conclusion and Future Work

Deep SAD is a generalization of unsupervised Deep SVDD method. The goal of Deep SAD is to extend the Deep SVDD approach to train with labeled anomalies. In doing so, it is more effectively able to anticipate anomalies that may be sampled from various distributions.

The results discussed in the experiments section suggest that general semi-supervised anomaly detection should be preferred when some labeled information on anomalous distribution is available.

Potential future works can include more rigorous analysis and studying deep anomaly detection under rate distortion curve, for example.

References

[1] Lukas Ruff, et al. “Deep Semi-Supervised Anomaly Detection.” International Conference on Learning Representations.

[2] Lukas Ruff, Robert A Vandermeulen, Nico Görnitz, Lucas Deecke, Shoaib A Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. Deep one-class classification. In ICML, volume 80, pp. 4390–4399, 2018.

Reference paper: https://arxiv.org/abs/1906.02694

Code for Experiments section: https://github.com/lukasruff/Deep-SAD-PyTorch

--

--