The Variational Autoencoder

Aditya Mehndiratta
IITG.ai
Published in
7 min readOct 9, 2019

This is the first article in the series of ‘Generative Models’ that try to decode the technology that makes machines do the things that were considered exclusive human endeavours like drawing, painting, writing, music etc. Can we teach machines these abilities?

So lets begin…Perhaps the most basic and fundamental of of all generative networks is the variational autoencoder. But before we begin we must first understand what an autoencoder really is?

The Autoencoder

The autoencoder is a neural network architecture made up of 2 sub structures

The Encoder:

The encoder is responsible for converting the large dimension input data into a lower dimension representation vector.

The Decoder:

The decoders job is to take the lower dimension representation vector and convert it back into the input image domain.The output image is known as the reconstructed image. The closer the output image is to the input the better!

The Autoencoder Structure

The network is trained well if it learns weights that minimises the loss between reconstructed image and original image.Now consider this amazing idea suppose our lower dimension representation vector was the 2D plane. If we pick any random point from the 2D plane and pass it through a decoder we should get an output image which may or may not be in the original set! Wolah there you have your first generative model!

Let us also understand it in code in Pytorch. The entire code can be found using the Github links at the end of article. The outline of the model is shown feel free to clone the repository and play around with the parameters. Also I initially worked with convolution layers but found out that fully connected layers work better.

The Model Outline

The model dimensions of the simple end to end fully connected model is shown

Model Dimensions

The model is trained on the MNIST dataset and the representation vector is in the 2D plane. The Autoencoder output of the 64 digits generated from random normal distribution is shown. I encourage you to try out various points and pass it through the decoder and see the results!

Generated images from Autoencoder

Further there is another interesting plot on the test set where a scatter plotter of the output representation vector’s x and y co-ordinates is shown… (remember the output dimension of the encoder is 2D). Also each colour represents the label of the image.

The fascinating thing is that the plot is made for the test data and the labels are unseen yet a pattern emerges with similar digits clustered together!

What is unwanted however is

  • The high degree of overlap
  • The low level of continuity in space i.e suppose we select a point (x,y) as say digit 5 then a point neighbouring x,y must also be a image similar to 5.
  • As a generative model the spread of the digits must also be nearly uniform so if we go about sampling a point randomly it is more likely to be a 5 than a 8!

We have chosen only two dimensions to represent the representation space but this problem is greater in larger dimensions when we have to encode data like faces.. This is the motivation for studying a better architecture the Variational Autoencoder!

The Variational Autoencoder

Let us now make a few important changes. We need to incentivize our model to group similar images together. The solution is as follows instead of the encoder output representing a single point in the 2D plane we will now match it to a multivariate normal probability distribution.

6 is encoded with the help of a normal distribution

Remember that we do not assume any correlation between the two dimensions. So essentially a VAE will take a input image and map it to to two vectors i.e the mean vector(mu) and the logarithm of variance(log_var)…we don’t have a covariance term as it is assumed to be 0!

Also instead of variance we use the log of variance because the variance is only positive and we would rather prefer a value from (-inf,+inf) which is the natural output range of neural network.

Since our encoder now outputs 2 vectors mu(mean vector) and log_var(variance matrix which is diagonal considering co-variance between dimensions is 0) how do we represent the image on a 2D plane?

We map it to a point z = mu + sigma * epsilon (sigma=exp(log_var/2) (epsilon is a point sampled from a normal distribution)

So why are we doing this? Remember our autoencoder case we realised that there is no incentive for our model to the make the space continuous but here since we are sampling a random point around mu the reconstruction loss ensures that even if we sample a point around the original data point we should get a similar reconstructed image!!

Lets get down to some code!

VAE model outline

Notice the additions we have now… we also have defined a function ‘sampling’ that takes output from the encoder and returns a point z in the 2D latent space using the formula discussed above

The Loss Function

The decoder is same as it is even in the case of the VAE and we have already seen the changes in the encoder the other difference from the vanilla autoencoder case is the loss function.Previously the loss function was the binary cross entropy loss between the the individual pixels of the reconstructed and input image now in addition we have an additional loss known as Kullback–Leibler (KL) divergence loss. In a nutshell the KL divergence loss is a measure of how different two probability distributions are. In our case we want to measure how different our distribution is from the standard distribution hence KL divergence loss takes the form(Interested ones can refer this article of KL divergence for a more rigorous explanation)

kl_loss = -0.5 * sum(1 + log_var — mu ^ 2 — exp(log_var))

Observe that when log_var=0 and mu=0 our distribution is identical to standard normal and the kl_loss is 0! Hence the total loss function is a sum of both reconstruction loss and KL divergence loss

The Loss function of VAE

Similarly as in our previous case we will now try to represent the outputs of the VAE on the test set.

VAE output

Unimpressed? Wait up! Let us represent the encoder output which after sampling gave us a point in the 2D plane…

The VAE output

This is much better! The distribution is much more closely spaced also more uniform i.e one digits isn’t spread too much in comparison to another. Remember the main role of a generative network should be to give good output representations i.e that is similar to the input domain,second it should give us a reliable way to sample new points so that we can in advance get an idea of whatever we are trying to generate! i.e a new point close to say image 1 should be similar to 1 only.

Thus we have now found a way to make the space more continuous also we have found a reliable of sampling i.e from a normal distribution (see above for the formula for z vector where epsilon was taken from a standard normal distribution).

Here we have focused on the neural network approach characterised by encoder,decoder and loss function however I would recommend this link if one wants to really understand why it is called a ‘variational’ autoencoder in the first place.The idea covered are part of the probability model perspective and currently out of scope of this article.

The real benefit of VAE can be further exemplified by taking a more complex dataset perhaps the celebrities face dataset is a good exercise. Since we have only used a 2D plane for the representation vector for easy visualisation we may not get the best resolution images but feel free to try increasing the representation vector dimension I would love to hear about these experiments in the comments. Also don’t forget to clap if you found the article helpful and follow IITG.ai for more articles and updates.

Links and References:

--

--