Building Intuition: Variational Autoencoders (VAEs)

How Variational Autoencoders are able to generate new data

Harshita Sharma
Accredian
7 min readSep 25, 2023

--

Introduction

If we talk about neural networks in general, deep learning has been the basis of which we can solve so many problems today. Be it object detection, language translation , audio classification and so much more.

Even though the input of all these problems are different and are focussing on different things, there’s a common pattern that is followed by them, i.e if you think about it, the trained neural network architecture gives additional information along with the result.

For example, for object detection, you provide a bunch of images and in the output you get additional information of what objects are present and where they are in the image.

An object detection layout

There are different type of neural networks which do not give us any additional information, rather they generate sample images or text or whatever kind of data is given to them as an input. These neural networks are known as Generative Models.

Generative Models example

The discriminative models like logistic regression, decision trees, support vector machines, and deep neural networks for classification and regression tasks, they aim to describe or classify data into predefined categories or make predictions based on input features.

These models focus on modeling the conditional probability distribution of the target variable given the input features. That is Supervised Learning.

Talking about Generative Models, they are a part of Unsupervised Learning, where models are designed to generate new data samples that are similar to the training data. They learn the underlying data distribution and can create entirely new instances that are statistically similar to the training data.

Generative Models

Variational Autoencoders(VAE’s) belong to this class of neural networks along with many other famous one’s like Generative Adversarial Networks(GANs), Diffusion models etc.

Autoencoders

As already established, Variational Autoencoders are a type of generative model based on another neural network architecture known as Autoencoders which is mainly focussed on data compression and learning.

An Autoencoder architecture

Autoencoders are made up of 2 parts: Encoders and Decoders, which is same as the VAEs, the only difference being the loss function.

Autoencoders minimises the reconstruction loss but VAEs has to minimise latent loss along with reconstruction loss.

What encoders do is convert the input provided to a network into its lower-dimensional representation (data compression), which is basically what can be understood by the computer. This is known as latent vector/variable.

The decoder part then takes this vector and expands out the information in order to reconstruct the same input sample.

The Latent Vector

Intuitively the 1st question that comes into mind is what the point of trying to generate the output same as the input?

Generator for GANs and Decoder for VAEs

The answer to that is, using autoencoders, we don’t care about the output itself but the latent vector created in the process.

This vector is important as it’s the representation of the input data, the text or image etc. into numerical vectors. This vector can now be fed into some other complex and useful architectures to solve big problems.

But this also means that the latent vector contains information solely about the data it was trained on. It does not have the flexibility to generate entirely new combinations or variations of data i.e something that we cannot do with autoencoders is generate data and this is where VAEs come in.

The Challenge: Sampling from Unknown Distributions:

Let’s take an intuitive approach to understand this.

We already gave the latent vector a huge importance earlier and that actually explains a lot of things. As autoencoders are able to generate the same image again that means the values in the latent vector corresponds exactly to the input data.

During training, autoencoders learn to map input data to a specific point in the latent space, such that the decoder can accurately reconstruct the input from that point.

So what if at the time of testing we provide this latent vector with random values? You guessed it right, what we’ll end up is some trash output signifying that the vector is a really important aspect.

Now, how is the vector able to create the same image, but not any new combinations?

To understand this let’s see what really happens inside of this latent space through which the vector is formed.

To get the right output we need some method to determine this latent vector and the idea behind this method is sampling from a distribution. We’ll take a look at how exactly this happens in order to build upon our intuitive understanding.

Distributions as pools (Idea: Code Emporium)

Think of latent space as a vast space with different pools. Talking about Distribution first, assume that a distribution is simply a pool which contains similar information. For example, suppose we have a “cat pool”, then this pool will have all the information about the cat from the input like eyes, nose, whiskers etc. represented in the form of a vector. Therefore distribution is just a pool of vectors.

Sampling as to our basic knowledge is randomly selecting some values from a huge group of values. That’s exactly what it means here too. Sampling is simply selecting vectors from these pools.

But that seems too good to be true right? That’s because we can only select the values from the pools when we know where the pool is.

The problem is that we don’t know where these pools are located within the latent space.

Sampling from a completely unknown, random pool will result in a nonsense result

Consider this vast latent space, it can have different pools built according to the data that was fed during training, i.e we can have a dog pool, a cat pool, a panda pool and so on. These are learnt internally by autoencoders and there’s no way for us to find these.

Therefore at the time of testing, there is a pretty high chance of us sampling a completely trash vector with no relevant values from the relevant pools (distributions), simply answering our question of why autoencoders are not generative in nature.

We simply cannot generate a dog image, if we can’t reach the dog pool and can’t assign the correct values to the vector in the generation phase.

Variational Autoencoders

But what if we did know where to find these distributions? This is where Variational Autoencoders come into action.

In VAE’s, we contraint the space which consists of all the distributions where we want to draw the samples from them. This is done during the training phase. This constraint makes sure that the distributions (representing different animals in our example) are organized in a structured manner.

Variational Autoencoder having a constraint latent space

Also, in VAEs, instead of treating the latent vector as a single point, we consider it as following a probability distribution. Think of this as turning each “pool” into a well-defined statistical distribution. So, the “dog pool” is not just a point but a distribution of possible dog representations.

As the area where we have to sample is constrained, the chances of sampling vectors that can produce valid looking images is much higher.

For example, we train our VAE on the MNIST handwritten digits dataset. Doing that will create pools of numbers from 0 to 9. As the pools are in a defined region and this region is continous (i.e range of values within the distribution that includes an infinite number of possible values), sampling values from them and even slightly changing the value in the vector will produce valid but different results.

Latent space for 0–9 digits
Images generated based on the sampled vector

VAEs use a combination of loss functions during training. The primary components are the reconstruction loss and the latent loss (often the Kullback-Leibler (KL) divergence).

Structure of a VAE

The reconstruction loss measures how well the decoder can reconstruct the input data from a point in the latent space. The latent loss ensures that the latent space follows a well-structured Gaussian distribution. Balancing these two losses is essential for a VAE’s effectiveness.

This is how Variational Autoencoders are able to generate data, which gets better with training.

Conclusion

Autoencoders and VAEs share the goal of learning compact representations of data, with VAEs adding a layer of probabilistic modeling to make data generation more flexible and controlled.

By modeling distributions and constraining the latent space, VAEs enable the generation of novel and meaningful data samples, making them a valuable tool in various generative tasks, such as image generation, text synthesis, and more.

Hopefully this explanation was able to help with your understanding of these concepts as they help us appreciate the importance and capabilities of VAEs in the world of machine learning and artificial intelligence.

--

--