Image Anomaly Detection / Novelty Detection Using Convolutional Auto Encoders In Keras & Tensorflow 2.0

7 min readJan 29, 2020

In many computer vision systems the goal is to detect when something out of the ordinary has occurred: the anomaly. Often, we do not know in advance what the anomalous image will look like and it may be impossible to obtain image data that can represent all of the anomalies that we wish to detect. This lack of suitable data rules out conventional image classification as a means to solve the problem. In this article, I explain how autoencoders combined with kernel density estimation can be used for image anomaly detection even when the training set is comprised only of normal images.

I will outline how to create a convolutional autoencoder for anomaly detection/novelty detection in colour images using the Keras library. The code, available on Github also demonstrates how to train Keras models using generator functions which load and preprocess batches of images from disk, early stopping (ending training when loss on the validation set no longer improves), model checkpointing (saving the best model to disk) and extracting values from the hidden layers of a pre-trained Keras model.

The basic architecture of an autoencoder is an input layer followed by one or more hidden layers that compress the representation of the data into progressively fewer dimensions (the encoder part of the network), followed by one or more hidden layers that get progressively larger such that the final output layer is the same size as the input layer (the decoder). After the data is run through the encoder the data has been effectively compressed into a lower-dimensional form from which it must be reconstructed. The training objective is for the network to reconstruct the image as accurately as possible.

A feed-forward autoencoder model where each square at the input and output layers would represent one image pixel and each square in the middle layers represents a fully connected node.

For our purposes we will use a convolutional autoencoder, this differs slightly from the representation shown above. Colour images are represented as a 3d tensor. This can be thought of as 3 matrices that each represent one colour component of the image (red, green and blue) each of the 3 matrices has height and width equivalent to the number of pixels in the original image.

Dimensionality reduction by max-pooling.

In this convolutional autoencoder dimensionality reduction is achieved using ‘max-pooling’. Imagine taking one 2x2 square of the image matrix, this would contain 4 numbers, in max-pooling, we reduce this 2x2 square down to a single number by taking the maximum of the 4. This is done for each 2x2 square in the image thereby reducing the height and width of the image by a factor of two. The size of the area that is downsampled is a hyperparameter that can be changed, for example setting the pool-size to 4x4 will reduce the height and width by a factor of 4.

For this experiment, I decided to use the Fruits 360 dataset which is available on Kaggle. For our purposes, we will treat images of apples as normal and after training the system using solely images of apples, we will test its ability to identify any other fruit or vegetable as anomalous.

Having been trained on images of apples alone the convolutional autoencoder becomes specialised in reconstructing images of apples. Therefore, when the network is tested on a non-apple image, the reconstruction error will be noticeably higher and we use this metric as a signal to identify images that are different to the training set.

On the left we have some samples of the original training data, the middle shows a low dimensional representation of the input image, the dimensionality of the middle images is the same dimensionality as the most compressed layer in the autoencoder. On the right is the output of the autoencoder: the attempt to reconstruct the original image from the highly compressed form.

Because the model has been trained on only apples, when it tries to recreate the aubergine images, it creates something that looks more like an apple, as such, the error on the recreated image compared with the original image is on average higher than the reconstructed images of apples. This is reflected in the histogram below.

The reconstruction error rate is significantly higher for the images that the autoencoder has not been trained on. We can use this as a signal to detect anomalies.

Of course, we could achieve much better separation if we had just trained a classifier to distinguish between apples and aubergines but the goal here is to build a system that works when we don’t have data for what the anomaly will look like. Deciding an anomaly threshold for the reconstruction error is done by looking at the distribution of the reconstruction errors on the normal image data and choosing some percentile.

Kernel density estimation as a metric of anomalousness/novelty

[Beggel et al. 2019] in their paper “ Robust Anomaly Detection in Images using Adversarial Autoencoders”, propose an interesting addition to this autoencoder model. Instead of relying solely on the reconstruction error they also consider the likelihood of an image in latent space (the most compressed layer of the autoencoder network).

In order to model the likelihood that a particular image is from the normal class, we use a technique called kernel density estimation. Viewing our image dataset as a collection of vectors, kernel density estimation measures how densely each part of the vector space is occupied by the training data. Given a new observation, a density estimation is made for this point by considering how far the observation lies from the data observed during training. The problem with this approach, if applied directly to unprocessed images, is the dimensionality of the image is extremely high making distinctions between distances much more difficult. When dealing with a small colour image that is 100 x 100 pixels we are dealing with a vector space which has 30,000 dimensions. Due to the ‘curse of dimensionality’ measuring distance or density in this space becomes almost impossible with all data-points appearing equally far from each other. (If you enjoy thinking about the abstract geometry of high-dimensional hypercubes and hyperspheres I recommend reading this thread about the topic). Given that our autoencoder has learned to generate compressed versions of the images in latent space, we will use these low dimensional representations to model the probability density of the latent space. The assumption here is that anomalous images will occupy areas of the latent space that were seen less frequently in the normal training data and therefore have lower density. One way to think about density estimation is to imagine all the possible images plotted on a map, the map here is defined by the lower dimensional latent space that the encoder converts the images to. The density of any point on the map can be calculated by considering how crowded that particular part of the map is when all of our normal training examples are plotted. We hope that anomalous and novel images will be far away from the normal examples in uncrowded parts of the map.

After giving a density score to each compressed image, we can see that anomalous images exist in parts of the latent space far away from the normal examples.

We now have two metrics by which to measure the novelty or anomalousness of an image. If the reconstruction error is higher than 95% of our normal images or if the probability density of the encoded image vector is lower than 99% of the normal images then we will classify the image as an anomaly. These thresholds are set in a heuristic manner and would vary depending on the task at hand. Using these thresholds we find that the system classifies apples as normal with accuracy around 0.88 and Aubergines as anomalous with accuracy around 0.97. When tested with the more difficult task of detecting onions as anomalies compared with apples as the normal case, the system could still correctly identify onions as anomalous with an accuracy of 0.95. When peppers were used as the anomalous class the accuracy was 0.98. This suggests that the system performs relatively well in detecting any images that are not apples.

Ideas for improving the model

In cases where it is possible to obtain representative samples of anomalous images, it is likely that a conventional image classification approach will outperform this approach where only normal images are used. Synthetic data and data augmentation may be used to construct a training set sufficient to treat this as a binary classification problem with a ‘normal’ and ‘anomalous’ class. Other interesting approaches to anomaly detection and novelty detection are proposed by Perera et al. 2019 “Learning Deep Features for One Class Classification” and Pidhorskyi et al. 2018 “Generative Probabilistic Novelty Detection with Adversarial Autoencoders”.

Full code and data are available on Github.

Image Anomaly Detection / Novelty Detection Using Convolutional Auto Encoders In Keras & Tensorflow 2.0

Written by Jude Wells