Variational Autoencoders -EXPLAINED

Shivang Mistry
Analytics Vidhya
Published in
7 min readJan 3, 2020

See these faces below?

Look real right? What if I told you none of these people actually exist?

I’ll let you have your 🤯 moment.

What you see are results from a generative deep learning model. That’s right, a computer made these! These are the models which are the culprits for those fake videos you see of Obama, and Trump. However, generative models are also used for really cool applications, such as creating music, recolouring black and white photos, creating art, and even in drug discovery. There are many models that fall under the category of generative models, and the popular two are Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs).

“Where there is will, there is VAE”

Variational Autoencoders are a popular and older type of generative models that are based off the structure of standard autoencoders. It consists of an encoder, decoder and a loss function. VAEs let us learn latent representations of an input. Using them we are able to represent and even synthesize complex in the context of simple models. VAEs have already shown promise in generating many types of complex data such as handwritten digits, faces, and synthesizing new molecules.

🤖 Autoencoders

Before learning about how Variational Autoencoders work, let’s first understand how standard autoencoders work. An autoencoder is a neural network that learns to copy its input to its output, and are an unsupervised learning technique, which means that the network only receives the input, not the input label.

An autoencoder has 3 parts:

  • Encoder
  • Latent Space
  • Decoder

The encoder network compresses the input and the decoder network decompresses it to produce an output.

The encoder network is composed of Convolutional Layers. If you’re familiar with CNNs, you would know that it compresses the input with convolutional, and pooling layers to create a much more compact and dense representation of the input. This is then used by the fully connected layers to make a prediction. The encoder network works the same way, and outputs a dense representation of the input, called encodings. The decoder network uses Deconvolutional Layers, which is pretty much the reverse of Convolutional Layers.

Deconvolutional Layers [source]

“Oh no, what a big loss!”

When the data is compressed there may be information lost, which means the information can’t be recovered in the decoding process. This is called a lossy encoding. If it does not lose any information, then it is a lossless encoding. The encoder learns to reduce the data to only the important information, and the decoder learns to take the compressed data, and decode it to get the final output.

An autoencoder’s goal is to make its output similar to its input. By training it on a library of samples and adjusting the model parameters, it is able to produce samples that are similar to the input.

Sounds easy right?

The hard part is compressing all the information through the bottle neck. Imagine trying to stuff a pillow into a purse. You’d have to press and squeeze it in. Depending on how big the purse the less you’ll have to squeeze and press the pillow, thus the shape of the pillow won’t change much. The same way the more variables the bottleneck has to represent the information the closer the output will be similar to the input.

Because of its ability to encode and decode data efficiently, autoencoders are great for:

  • Dimensionality Reduction
  • Data Denoising
  • Watermark Removal
  • Feature variation
  • Image Segmentation

The problem with standard autoencoders

Though Autoencoders may have many applications, it can still be limiting. Autoendcoders are only able to generate compact representations of the input, and reconstruct the input, which is great for data denoising and dimensionality reduction, but not for generation. When generating new images we don’t want to replicate the image, instead we want to generate variations of the input.

This is because the latent space where the compressed inputs or encodings lie, is not continuous and cannot be easily interpolated. The latent space for an autoencoder groups the encodings into discrete clusters, and this makes sense as it makes it easier for a decoder to decode.

To illustrate this point, let’s say we had a box, and in it we have 3 distinct piles of candies, candy cane, lollipops, and jellybeans.

These piles of candies represent the clusters of encodings in the latent space. Now you are told to choose the candy you like blindfolded! You can’t see where the candy groups are and also the groups are small. There is a higher chance of you not getting anything, than actually getting a candy.

Similarly, it is hard to generate new data because there are huge gaps between the distinct groups. This means that every input has a vector in the space, but every vector in the space does not have an input. The decoder has no idea on what to do because during training it has never seen encoded vectors from those gaps. Even though, it will create an output for every vector, most of them are not recognizable.

“Cue the Variational Autoencoders”

Variational Autoencoders are great for generating completely new data, just like the faces we saw in the beginning. It is able to do this because of the fundamental changes in its architecture.

  1. Encoder

2. Latent Distribution, which includes:

  • Mean Vector
  • Standard Deviation Vector

3. Sampled Latent Representation

4. Decoder

The difference

The primary difference between an autoencoder and a Variational Autoencoder is that an autoencoder clusters the encodings into distinct groups, whereas VAEs’ encoding clusters are not distinct and are more continuous. They are mapped to a distribution, making it much more easier when sampling from the latent space and generating new images.

It is able to do this because the encoder is outputting 2 vectors, a mean vector and a standard deviation vector. By using the 2 vector outputs, the variational autoencoder is able to sample across a continuous space based on what it has learned from the input data.

Intuitively, the mean is where the encoding should be in the latent space, and the standard deviation is the area around the point. While training, the decoder not only learns that one single point in the latent space, but also the from the vectors surrounding the point.

“Just a regular(ization) guy”

Only mapping the vectors to a distribution is not enough to generate new data. Since there are no limits on what values the mean and standard deviation vectors take, the encoder can return distributions with different means for different classes or clusters with small variance from the mean, so that the encodings don’t vary much from the same sample. Resulting in encoded distributions being far apart from each other. This just allows the decoder to recreate the input, sound familiar?

We want it so that all the clusters are continuous, and kind of overlap. So to make sure that our VAE does not become a standard Autoencoder, we need a regularization term. We can introduce the Kullback-Leibler divergence (KL divergence) to our loss function.

The regularization term forces the encoder to distribute close to a standard normal distributions (mean of 0, and standard deviation of 1). By doing so, it makes sure the distributions are much closer and actually overlap. This results in some sort of certainty that the encodings are truly continuous.

By only optimizing for the KL divergence, the encoder will randomly plot the points close to the center. There won’t be any clusters, just a mess of points. We want it to be more organized by classes, and so we also have to use the reconstruction loss. These two terms make up the loss function of Variational Autoencoders.

Future of VAEs (Especially in healthcare)

Sure generating faces and handwritten numbers are exciting, but what excites me the most is generative models being used in healthcare and drug discovery.

Insilicco Medicine, a biotech company, has been able to synthesize a new molecule within 21 days, and validate it in only 25, compared to 2–3 years required by pharmaceutical industries! That’s crazy! And this is just the start and generative models are on the forefront of innovation. Rapid discoveries will result in wider accessibility to people across the world. Using our genome data, we can have personalized medicines that are much more effective. Imagine if we could synthesize new drugs in the time it takes you to make a hamburger. The possibilities are truly endless!

Acknowledgements

Hey, hey, hey! If you are reading this, thank you 🙏 🙏 for making it to the end!

I’d love to connect through LinkedIn, and learn about your thoughts on the future of Generative models!

--

--