Mathematical Prerequisites For Understanding Autoencoders and Variational Autoencoders (VAEs): Beginner Friendly, Intermediate Exciting, and Expert Refreshing.

Published in

Analytics Vidhya

9 min readMay 28, 2020

A few days ago, my brother walks in my room and sees me writing code and studying a math tutorial, with a puzzled expression he asks, Are you studying maths now or building AI models, which? With a smile, I reply and say, AI is Maths!!

Well not totally maths per se, but to be really grounded one needs to have a very good grasp of the mathematical underpinnings as AI isn’t just about training ConvNets or Transformer models with Pytorch, Tensorflow or Keras. As a matter of fact, to read, understand and replicate a research paper requires at least a fair understanding of some mathematical concepts including but not limited to Linear Algebra, Calculus, Probability theory, and Statistics.

AI is free for everyone, so whether you come from a CS, Maths, or Engineering background or not doesn’t matter much, in fact, it doesn’t count at all because the internet now is a plethora of information especially with the advent of MOOCs, you can pretty much learn all these basics on your own and be better at it. I know that for a fact because in the last 5 years being an undergrad studying Urban Planning, the most complex maths I did was population mean, skewness, kurtosis, and moments. That was in my first year! I am about rounding up now and I’ve never taken a course that required maths one bit. My last serious maths was in high school, every other thing I know now was self-taught, I just picked it up along the way.

I know a lot of guys are in these shoes, you want to dive deep into AI but you feel you don’t have the right background so you stay put or go into web development. If you fall into this category, nothing should limit you!!!

In this post, we are going to cover some of the basic mathematics required to understand Autoencoders, Variational Autoencoders (VAEs), and Vector Quantised Variational Autoencoders (VQ-VAEs). Specifically, we would be looking at:

A review of Auto Encoders
Basics of Probability
Expectation maximization
Kulback Leibler Divergence and its significance

An Autoencoder is essentially a neural network that is designed to learn an identity function in an unsupervised way such that it can compress and reconstruct an original input, and by doing that it discovers a more efficient and compressed representation of the original input data. It is worthy of note that the idea was originated in the 1980s and later promoted in a seminal paper by Hinton and Salakhutdinov, 2006.

Autoencoders are widely used for image compression and reconstruction. Image reconstruction basically means that the Autoencoder network tries to generate whatever image we pass into it at the input stage. We would soon come to the understanding of how this works as we progress.

Illustration of the Autoencoder architecture. Source: Lilianweng’s blog

The image above portrays a typical autoencoder network architecture and to understand how this works intuitively, we should first learn that the autoencoder network consists of:

Leaving out the input and output(reconstructed input)

The Encoder block, denoted by Gφ
The Bottleneck, denoted by, z
The Decoder block denoted by, Fθ

The Encoder block takes in an input vector of images and passes it to the bottleneck as a compressed vector ‘z’, then the decoder block tries to reconstruct that input image from the compressed representation. For better understanding, say we have an input image of size (28 x 28), by convention, this image has to be flattened before feeding it into a neural network. A flattened representation of this image would be (784, ), and this is passed into the encoder (the first block). The output of the encoder is then fed to the bottleneck or latent space which should be a reduced version, for instance, if the amount nodes in the latent space is 8, or 16, or even any number, it simply means we have succeeded in compressing an image of size 784 to just 8 nodes, or 16 nodes or any number at all. The decoder network then tries to recreate the original (28 x 28) input image from the compressed state in the bottleneck. That’s how it works.

As soon as the image is reconstructed, you compare the reconstructed image with the original image, compute the difference, and calculate the loss which can then be minimized.

The Loss is calculated by:

Don’t get alarmed at the loss function, I’ll break it down.

As seen above, the loss function depends on ‘theta’ and ‘phi’ which are the parameters that define the encoder and the decoder. As explained earlier and from the autoencoder image above, the encoder is represented by Gφ, while the decoder is represented by Fθ and they simply mean the weights and bias of the neural network.

So in the equation, we are summing up the difference between the original image, x`, and the reconstructed image Fθ(gφ(x`)).

This mathematical representation also shows the flow of data from the encoder to the decoder (i.e input to output)

Variational Autoencoder

Illustration of the VAE model. Source: Lilianweng’s blog

The basic idea behind the VAE proposed by Kingma et al in 2013 is that instead of mapping an input to a fixed vector, the input is mapped to a distribution. the Autoencoder and the variational autoencoder are similar in many ways, as a matter of fact, the only fundamental difference between an autoencoder and a variational autoencoder is that the bottleneck of the VAE is continuous and replaced by two separate vectors; one representing the means of the distribution, and the other representing the standard deviations of the distribution.

The loss function of the VAE is defined by two terms, the reconstruction loss and the regularizer which is essentially a KL divergence between the encoder’s distribution and the latent space.

VAE Loss Function, basically the reconstruction loss + KL divergence

Moving forward, understanding the deeper mathematical underpinnings of the autoencoder and the VAE network such as their loss functions requires a fair understanding of some concepts such as Expectation maximization, Conditional probabilities, Maximum likelihood estimation, and Kulback Leibler divergence, which is the crux of this article.

The first prerequisite for understanding VAE is Probability Theory, and I will not go into too much detail but will be explaining as much as is needed to understand these concepts.

You would be coming across the following terms more often:

P(x): This defines the probability of a random variable X

P(x|y): As known as conditional probability and it provides the probability of a random variable x, provided y has occurred. It is read as P of x given y.

this probability concept can also be written as:

The representation above comes from Baye’s theorem, where;

P(y|x) is the posterior probability

p(y) is the prior probability

p(x|y) / p(x) is the likelihood ratio

We need to also understand the theorem of total probability which goes thus;

An exclusive event also known as disjoint means that there is no overlap between them, and that is why their intersection is equal to zero. We’ll be seeing more of these as we get to the more complicated stuff, so its best to get it out of the way now.

Expectation of a Random variable X i.e E(X)

The expected value of a random variable is a weighted average of all the possible values of X that it can take, where each value is being weighted according to the probability of that event and it is defined as:

This might be getting a little too involved, but you should just know that this concept is similar to the mathematical average. For better intuition, you should check this out

Let us look at some simple examples:

Q1. When a die is tossed once, what is the probability of getting a three?

Answer: You guessed right! given the sample space = {1,2,3,4,5,6}

P(x) = P(3) = 1/6.

Q2. In tossing a fair die, what is the probability that 3 has occurred conditioned on the toss being odd?

Answer: This is a conditional probability i.e P(x|y ), and what we are after is the probability of getting an x=3, giving that y=odd. that is, P(3|y being odd).

Since the condition is that the toss is “odd”, then definitely the sample space has to reduce from 6 to 3, because there are just 3 odd numbers if you count from 1–6. Therefore, {1,2,3,4,5,6} ==> {1,3,5}. Hence, the probability of having a 3 in this reduced sample space is 1/3, where 3 is the total number of samples we have in the reduced sample space. This is about the simplest example I could get explaining conditional probabilities.

If we observe correctly, in the first example, the probability of having a 3 i.e P(3) was 1/6, but in the second, the probability became 1/3. what does that show?

Again, you guessed right! It shows that the probability of getting a 3 increased when it is conditioned by Y, which is “odd samples”. Therefore it is important to know that p(x) and p(x|y) can have different interpretations. And that is the concept of conditional probability, as the probability of “x” can be affected a lot by “y” as seen in the examples.

KL Divergence

Kulback-Leibler Divergence (D_KL for short) is a measure of how one probability distribution is different from the other. For the discrete probability distribution P and Q, the KL divergence between and P and Q is defined as:

For example:

Suppose we have two probability distributions, P & Q. and we want to find the difference between the two probabilities, we can simply apply the KL divergence as shown below.

Medium doesn’t actually support mathematical symbols, that is why the equations have to be solved elsewhere then I paste it here as an image. To explain the calculation above, we have a distribution Q with a uniform probability of 1/3 which is ≈ 0.333, and a distribution P with a probability of 0.36 when the random variable x is equal to 0, 0.48 when the random variable x is equal to 1, and 0.16 when the random variable x is equal to 2.

The difference between these two distributions can be calculated using the KL divergence, so what we do is substitute the values in the KL divergence equation like we have above and solve for the answer which is 0.09673 nats.

“nats” is simply the unit of information obtained by using the natural logarithm (ln(x)).

Understanding the KL divergence is integral to grasping the VAE loss function because it plays a key role — acting as a regulariser term. And that is why I went the extra mile to illustrate it graphically and mathematically.

Now that we have gone over some of the mathematical prerequisites that would come in handy moving forward, you should take a look at this expository article on deriving the VAE loss function entirely from scratch by Dr. Stephen Odaibo.

Maths is Fun!

Mathematical Prerequisites For Understanding Autoencoders and Variational Autoencoders (VAEs): Beginner Friendly, Intermediate Exciting, and Expert Refreshing.

Variational Autoencoder

Expectation of a Random variable X i.e E(X)

KL Divergence

Written by Victor E. Irekponor