The Amazing VAE (Variational Autoencoder)

(Part 1) An overview of the problem and the solution

5 min readJul 7, 2024

Over the past 18 months, I have been obsessed with various machine learning topics, but none so far have captured my attention more than the Variational Autoencoder. In this series, I describe the theory behind the VAE, the challenges that lead its design, and some potential shortcomings of the implementation. I do this from the perspective of an engineer who has worked hard digging deep into the mathematics and Bayesian statistics employed by the architecture. I assume you have knowledge of basic algebra, calculus, probability, and statistics. You should also have good knowledge of Python programming, neural networks, and PyTorch. I’ll do my best to convey the most important takeaways, and I hope you find it useful.

Reality is complicated

In machine learning, we often desire to learn the underlying distribution of a dataset. This can allow us to generate new data that looks like our dataset, detect anomalies and unexpected patterns in new data, and more. Unfortunately, given a dataset with sufficient complexity (like most data from the real world), this quickly becomes intractable. Trying to fit our classic, simple parameterized models won’t work. It can be beneficial in these situations to employ latent variables and form a joint probability distribution.

The rules of marriage in the world of probability

Combining two probability distributions is a powerful concept, but it can be difficult to visualize. To understand what is happening, check out the image below:

By IkamusumeFan - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=30432580

We are pairing two Gaussian distributions (red and blue) side by side to produce a joint distribution. Where both distributions have high density, the joint distribution also has high density, as shown by the scatter of points at the center of the plane on the bottom.

The most interesting part to me is that these distributions (described by their probability density functions, or PDFs) don’t have to be directly correlated to one another to be laid out side by side like this. We just need to leverage something common that joins the two distributions. Some way to align them like in the above. As an example, pretend the above is the distribution of stock prices and the amount of global rain in a day. These two are not at all correlated (unless the entire world is flooded, but I digress), but we can capture the two values on the same day. Thus, we can form a joint distribution with them.

Now, let’s say we had a daily record of the stock market on its own and a daily record of stock prices and global rainfall together, but we didn’t have a record of the global rainfall on its own. We could construct the probability distribution (PDF) for global rainfall alone by integrating out the stock prices from the joint distribution.

If the integration part of it is confusing, think of it like this: For each given amount of rainfall, we sum the joint probability densities over all possible stock prices, and then scale this sum by the width of the interval between stock price values.

# Marginal PDF for rainfall by integrating out stock prices
# where S_values is a discrete set of values that covers the 
# full range of stock prices. This demonstrates the idea, 
# but there are more accurate ways of doing this.
def marginal_pdf_rainfall(R):
    sum_density = 0
    for S in S_values:
        sum_density += joint_pdf(S, R)
    return sum_density * (S_values[1] - S_values[0])

This process effectively removes the influence of the stock prices. We’re removing the dots in the graph above that are contributed by the stock prices, line by line. This leaves only the density of dots from the rainfall. By leveraging the joint distribution data this way, we can approximate or compute the marginal distribution of global rainfall even without direct observations of rainfall alone.

While this example explains the process, it’s also working in a very low dimensional space. Imagine the case of an image where we have to iterate over 100, 1000, 10000 dimensions, or even more. It quickly becomes infeasible. Intractable.

I came here for VAEs, not stocks and rain

That’s fair. So, what does this have to do with a VAE? As I mentioned earlier, most of the data we’ll want to work with is complex and isn’t likely to have a simple and known PDF. Utilizing a joint distribution between our data p(x) and a much simpler distribution of latent variables p(z) that can be sampled from, we can marginalize/integrate out z to get p(x) as explained above. This allows us to approximate the crazy distributions that come from the real world, including things like image data.

The architecture of a VAE is designed to:

Enable the construction of the joint distribution — p(x, z)
Enable an effcient search of values of z that produce good values of x and construction of a distribution for these values — q(z|X)
Constructing a method by which to optimize the parameters of neural networks that consider both the reproduction of X (data) values, as well as maintaining a Gaussian distribution for q(z|X), ensuring we can sample from X given a value of z.
Enable sampling from a distribution approximating p(x) by specifying a value of z and then integrating z out of pθ(x|z) (the distribution of our data given a value of z)
Work around intractability regarding all of the above.

The VAE accomplishes all of this with an encoder/decoder design.

A crude drawing of how all of this is put together

In the parts that follow in this series, I will cover:

The problem with sampling high-dimensional spaces and the role the encoder plays to help with this. While the latent space is modeled to be simpler than the data space, it is still high dimensional.
Reproducing images in the decoder, how to measure our goodness of fit, and how the training loop and backpropagation bring it all together.
Issues inherent with the design, some improvements to the training process, and conclusion.

Stick around for Part 2 next week!

The Amazing VAE (Variational Autoencoder)

(Part 1) An overview of the problem and the solution

Reality is complicated

The rules of marriage in the world of probability

I came here for VAEs, not stocks and rain

Written by David Daeschler