GAN for unsupervised anomaly detection on X-ray images.

An attempt at using Generative Adversarial Network to do more than just generating cool images.

Why anomaly detection on X-ray images

Machine Learning (ML) and Deep Learning (DL) for healthcare is a very active area of research in both academia and industry nowadays. ML and DL are promising in ways that they help doctors/researchers in finding new cures for diseases that are currently incurable, or they can augment physicians and doctor to perform faster and better.

However, ML and DL have a strong crave for data, especially in case of medical data.

  • Abnormal medical cases are usually much rarer than normal cases so it is usually heavily skewed toward normal cases (negative samples). It is very time-consuming to collect a reasonable amount for all cases.
  • For supervised ML/DL approach, the amount of data collected need to be labeled by qualified physicians/doctors as well. So it’s very expensive to create labels for medical datasets.

A DL model that can perform relatively well— not necessarily outperforms doctors, but doesn’t require a large amount of balanced data and high-quality labels can bring tremendous advantages in terms of cost and speed to any healthcare system.

In this blog post, we explore the idea of building such model with Generative Adversarial Network (GAN).

About GANs

GAN-generated dog-ball. Source: BigGAN

GAN is a family of Neural Network (NN) models that have two or more NN components (Generator/Discriminator) competing adversarially with each other that result in component NNs get better over time. GAN can also be viewed as essentially a learned loss function.

As a result of adversarial training, GANs can perform cool tasks that were previously not possible, such as generating realistic images.

Please refer this excellent article for a gentle introduction to GAN.

Why GANs

In order to generate realistic samples in a multi-modal distribution of images, GAN’s Generator and Discriminator must have learned hierarchically the high-level features of input data, so that it can distinguish between real and synthetic samples.

More importantly, the labels to train the Discriminator come for free (this sample is real or fake), so the learned features would be extremely useful when training labels are hard to obtain.

But can these learned features from GANs be used for something else, such as unsupervised anomaly detection? GAN can certainly do cool stuffs, but can it do useful stuff too?

In next part, we will explore the idea of using these learned features to build an unsupervised anomaly detection model on X-ray images.

Anomaly Detection strategy:

  1. Train GAN to generate only normal X-ray images (negative samples).
  2. When predicting anomaly, use GAN to reconstruct the input images of both normal and abnormal images (negative and positive samples).
  3. Compute reconstruction, feature matching and discrimination losses.
  4. Discriminate between normal and abnormal cases using these statistic.
  • Reconstruction loss are the differences between original and reconstructed images.
  • Feature matching losses are the differences between encoded features of hidden layers in the Encoder and Discriminator.
  • Discrimination loss is simply the output of the Discriminator.

Because GAN is trained only on normal images, we hypothesize that these visual and statistical losses for the normal images (similar to train data) and statistic for abnormal image (out of train distribution) will have some differences.

Even though the Discriminator’s original task is to discriminate between the real-normal and fake-normal images, we will attempt to repurpose the Discriminator to perform on real-normal and real-abnormal data.


Vanilla GAN architecture. Source: Mihaela Rosca 2018

In the strategy above, we need a model that is trained adversarially with the ability to encode and reconstruct images. The vanilla GAN architecture only has:

  • Generator that outputs random images samples from random latent vectors.
  • Discriminator that classifies real and fake samples.

In order to have the encoding and reconstructing abilities, we need to modify the vanilla architecture:

Bi-directional GAN

Generator G, Encoder E and joint Discriminator D. Source: BiGAN

BiGAN or ALI extends the vanilla GAN model with an Encoder NN component. The Encoder is mathematically proved to be the inverse function of the Generator even though they are not directly connected with each other during training.

The Discriminator in BiGAN is now a joint Discriminator, instead of discriminating between the input image and the generated sample, it now tries to discriminate between pair of data X and latent variable Z.


Encoder, Generator, Discriminator D and Code Discriminator C. Source: Mihaela Rosca 2018

Alpha-GAN is an attempt at combining Auto-Encoder (AE) family with GAN architecture. It starts with the Encoder and Decoder/Generator components from AE and take advantage of GAN as a learned loss function in addition to the traditional L1/L2 loss.

The biggest difference between Alpha-GAN and BiGAN is the direct connection from Encoder to Generator.

In BiGAN the gradients flows in parallel to the Encoder and Generator, while in Alpha-GAN the gradients flow sequentially from the Generator then back to the Encoder.

Generative results:

Figure below shows the outputs of the two best models trained on the MURA X-ray image dataset:

  • The reconstructed samples are generated from the latent variables z of image: x -> E(x) -> G(E(x))
  • The generated samples are created from the latent variables z sampled randomly: z -> G(z)


BiGAN samples.
BiGAN generated samples from the same latent variables over time.
  • It can be seen that the reconstructed samples of BiGAN do not closely resemble the original input images on the left. This can be explained by the architecture of BiGAN in which the Encoder and and Generator does not interact with each other during training.
  • The generated samples also lack the diversity in postures and brightness when compare with the original data distribution.


Alpha-GAN samples.
Alpha-GAN reconstructed samples over training epochs.

As the Encoder and Generator are chained together during training, Alpha-GAN can reconstruct samples almost perfectly, only a little blurry due to the L1 loss.

However when generating images from randomly sampled latent vectors, the results are much worse than those of BiGAN and do not resemble the training data.

It appears that the learned latent space is highly dependent on the Encoder’s outputs. The latent space is less generalized compare to BiGAN and when sampling randomly, it yields far worse results.

Discriminative results:

In this section, we use the many features from the Encoder and Discriminator to discriminate between normal and abnormal cases.


BiGAN statistics on validation set.

As can be seen from the histograms above, both the L1/L2 reconstruction losses and features matching losses from Encoder and Discriminator show very little difference between negative and positive images.

When we attempt to use these statistics to classify between normal and abnormal cases, the Receiver Operating Characteristic (ROC) curves in the lower right show that most features follow the diagonal baseline with the area under the curve fluctuating mildly around the random mark of 0.5.

Unfortunately, these unsupervised features from BiGAN can not be used for detecting anomalous X-Ray images.


Alpha-GAN statistic on validation set.

Similarly for Alpha-GAN, we plot the histograms of the L1/L2 losses and high-level features losses. In this case, the reconstruction losses display a difference between negative and positive samples. This can be attributed to alpha-GAN’s ability to reconstruct image observed above.

The ROC curves show that while other features follow the diagonal baseline closely, L1 and L2 losses show visibly deviation with the area under the ROC curve of 0.65.

Problems and challenges

The experimental results do not agree with the hypothesis proposed above, but why?

Could it be the hypothesis is correct, but the experiments were not good enough to achieve the desired results? Or whether there are still some missing pieces in the current approach of using GANs for anomaly detection?

Below are some challenges I think that are blocking progress:

  • Mode collapse: When GAN only generates a small subset of possible data space. This happen when the Generator sacrifices diversity for quality.
Loss of Encoder and Generator suddenly drop during training.
  • Discriminator assumption: In this recent work, the authors show that the same Discriminator implies different discriminative boundaries when initialized randomly. While these random assumptions by the Discriminator are good enough for its original purpose of generating realistic images, it will be a problem for other purposes such as detecting anomlies.
  • Sparse latent space: The input data can be reconstructed from the encoded latent vectors but GANs still lack the incentive to have a well generalized latent space. This phenomenon can be observed from the randomly generated samples in Alpha-GAN.


Even though we can not confirm the hypothesis of using GAN’s unsupervised features to detect anomaly, it has given us some insights on the internal operation of GANs. This is still an interesting approach and we will definitely see more research in this direction in the future.

With this article, now you have a better idea of how not to build an anomaly detection with GAN and hopefully you might have a better idea on how to do it.

If you want to reproduce the results or have an idea to improve, check out the Github repo: