Image Classification Using the Variational Autoencoder

Samson Afolabi
Jan 2, 2020 · 5 min read
Former Super Eagles coach, Stephen Keshi
Former Super Eagles(Nigeria Football team) coach, Stephen Keshi (source: Al Jazeera)

The Code for this project is available on Github.

Conventional Image identification using Neural Networks requires Image Labeling. Image Labeling can be a tedious, or more so expensive, activity to do. Imagine running a company that has large datasets of images, and every time we need to build an image identification algorithm for a particular image kind, we need to label the images as ‘instance’ and ‘not instances’. For example, ‘cats’ and ‘not cats’ or ‘dogs’ and ‘not dogs’. The problem is there are many kinds of ‘not instances’. For example, to build a cat image identifier we would label all cats as cats and dogs, goats, cars, humans, aeroplanes, etc. as not cats. If we want to build a Human Image identifier with the same dataset we would then have to label all human images as humans and every other image as non-humans.

This is obviously time consuming and expensive. The question then arises — can we build an image identifier without having to go through the rigour of image identification? Or better put, can we build a neural network that can learn the specific distribution of where an image comes from? Asides from image identification, this idea can be used to separate images according to their distribution. For example, separating unethical images from videos or separating ads from videos.

In this Project, I used the Variational Autoencoder(VAE) to solve these problems. I trained the Variational Autoencoder on images from a particular distribution so that, when images from a different distribution are fed into the VAE, the reconstruction loss is expected to be higher. Through the reconstruction loss, one may be able to keep track if an image belongs to a particular distribution or not. In this example, I considered using this idea in separating Football Images from ads.

The Variational Autoencoder

The Structure of the Variational Autoencoder
The Structure of the Variational Autoencoder
The Structure of the Variational Autoencoder

The VAE is a deep generative model just like the Generative Adversarial Networks (GANs). Deep generative models have shown an incredible ability to produce highly realistic pieces of content-like images. The Variational Autoencoder consists of an encoder, a latent space, and a decoder. The encoder and decoder are basically neural networks. The Variational Autoencoder is also well explained in this article.

The VAE takes in an input through the encoder and produces a much smaller, dense representation (the encoding) into the latent space that contains enough information for the next part of the network (the decoder) to process it into the desired output format, which in an optimal case, is the exact input fed into the encoder. Through the training process, the latent space of the VAE is by design continuous allowing easy random sampling and interpolation.

The VAE is optimized over 2 losses, the kl-loss and reconstruction loss (the difference between the input image and the reconstructed image). The kl-loss is the difference between the distribution of the latent space and a standard Gaussian with mean 0 (zero) and standard deviation 1 (one). This is to compress the distribution of the latent space to the standard distribution. This helps the decoder to map from every area of the latent space when decoding the image.

In my case (using the Variational Autoencoder to separate Football Images from ads), I had to break videos into frames (images). The VAE was then trained on images from this distribution (football images) only. It is important to note that the Variational Autoencoder is a pretty expensive algorithm to run computation wise, so I decided to run this on an EC2 instance. The data was stored on an S3 bucket. In ordered steps, we have:

  1. We convert football videos into frames.
  2. Building the Architecture of the VAE, and writing all the necessary functions.
  3. Moving the train bucket, to the S3 bucket, while training on the AWS EC2 Instance
  4. Repeat any steps until the reconstruction loss reduces considerably.
Reconstruction loss with 99,872,576 parameters

The image above shows the reconstruction loss of the VAE after some iterations of the steps listed above. I also have the following reconstructed images to show. Note that, I had to reduce the size of the Images fed into the encoder. This was to increase the speed and reduce the train time of the VAE while avoiding to use a too large EC2 instance which was not pocket-friendly for me. The image size fed into the encoder was a 60 x 80.

Input Image(non-reduced version) — Size is 1280 x 720. (Source :
Reconstructed Image
Reconstructed Image
Reconstructed Image — Size is 60 x 80 (Loss 139.55)

The algorithm does fairly well for football images. In the case of advertisements, the reconstruction loss of the VAE was higher. Of course, this was expected considering the VAE is somewhat familiar with only football images.

Since many football images had large areas containing the football field (the green turf), I presumed the algorithm might also learn this, such that when Images with large green areas are input into the VAE, it is possible the VAE takes them as soccer images even when they are not. I decided to test this by trying images from American football which also had a green turf, and as expected the algorithm thought they were still soccer images.

Reconstruction Loss of different Image types

From the results, the VAE has a True Positive Rate of 0.93. The VAE struggles to separate soccer images from American football images,while it also has a False Positive Rate of 0.31. In all, It is safe to conclude that through the reconstruction losses, we can be able to classify images according to their respective distributions without prior labeling.

In my next steps, I will like to train the Images on a larger EC2 instance. This will allow it to train with larger images and the possibility of training with more images to reduce the reconstruction loss.


I want to say a big thank you to Adam Green, Technical Director at Data Science Retreat, Berlin for his support on this project. A special thank you to my sweetheart, Olamide for her emotional support and to Wuraola for her help in editing this article.

For Questions or personal messages, feel free to contact me on Linkedin.

Further Reading

  1. Understanding Variational Autoencoders by Joseph Rocca. link
  2. Intuitively Understanding Variational Autoencoders by Irhum Shafkat. link

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium