Put Simply: (Variational) Autoencoders

5 min readApr 13, 2023

Welcome to a new series I’m planning to start: “Put Simply”. Ever had those articles that are actually well written but you’ve never really understood?

For the average Medium reader, the jumble of technical and mathematical jargon may not faze you that much, but the typical peons like me (and probably you given that you’re reading this) shudder at the sight of confusing mathematical symbols and complex words. Obviously some basic contextual knowledge is required, but for now the basics will suffice.

For this article, I will be covering the topic of Autoencoders and Variational Autoencoders, which I’ll abbreviate as AEs and VAEs. Besides making yourself sound cool when you say it, they are useful in something we call Principal Component Analysis (PCA), as well as Generative Network architectures.

With that said, you would need to know basic neural networks architecture and components, some machine learning concepts and basic statistics.

Purpose

Put simply (geddit?), an autoencoder does what it says on the tin and then a little bit more. Encoding means to simply use distinct symbols or notation to represent a bigger idea. Likewise, the typical AE does two things;

Encode data
Decode data

Let us say we have data consisting 50 variables to predict housing prices. Clearly it is excessive and some are definitely irrelevant to us.

The AE thus learns to encode the data such that we can reduce 50 variables to a smaller number (let us say 8) of relevant efficient variables without losing representation of the trend or idea. Think of it as a summary generator which can extract topics and brief insights from a long paragraph of text.

The autoencoder can also decode data. Using the summary example, it uses the insights and topics to formulate a paragraph that either resembles or matches the original paragraph we had earlier. Intuitively, the AE uses the 8 variables we established earlier to try to reconstruct the 50 variables.

This is powerful. We can now draw insights from noisy, convoluted data by condensing them into concise variables, and vice versa. AEs and VAEs can also, after training, generate new, synthetic and plausible data like images of faces for us based on, but not from, the training set.

AE? More like EA Sports

Treating AEs as a black box would suffice for the average layman, but I suspect (again because you’re reading this) you’d like to understand how it works. I will attempt to explain it simply, but details will be sacrificed if necessary to keep the explanation concise and simple.

An AE at it’s core is actually 2 separate neural networks. The encoder learns the most efficient way to condense the input into a specified number of variables to output h. The decoder learns to use the input h to construct the original input as closely as possible. If we assume h is N-dimensional and the input X is M-dimensional, an example of an AE could be:

AE layer size: M → 40 → N → 40 → M

Note the decrease from M to N nodes per layers for the encoder and vice versa for the decoder.

It is also possible to introduce non-linearity into the AE if, for example, the dataset is not linear. This means implementing an activation function (usually ReLU) within all nodes.

Is this Loss?

As any machine learning practitioner would tell you, “What’s the loss to backpropagate?”. The loss function for AEs is not very complicated. If we notate D(x) as the decoder and E(x) as the encoder, and assume continuous variables:

Loss = 1/n * (x — D(E(x)))², where n is the size of the training data

This is basically the sum of all Mean Squared Errors (MSE) over all training data. The loss here compares, essentially, how far the reconstructed data (by the decoder) differs from the original.

This loss is usually refered to as the reconstruction error and can take on the form of binary/sparse cross-entropy for discrete variables if needed. As long as the metric measures the difference between reconstruction and original approppriately, the reconstruction error will be valid.

VAEs

Variational Autoencoders are slightly different. Instead of simply condensing data to variables, VAEs condense data into a lower dimensional latent space.

This latent space is like a 2D diagram (but with more dimensions) that groups data according to their similarities, so closer data points in this latent space represent similar data points. The number of variables involved in the latent space are known as latent variables (this will be important later).

The encoder is able to compute the distribution of the latent space, which usually is a normal distribution, and output the parameters of that distribution, typically the mean and standard deviation. The decoder then uses the distribution to generate a sample h’, which is then decoded to form X’.

Reparame-what?

Remember how the decoder generates a sample from the latent variables? Turns out this is not differentiable, meaning backpropagation is not possible (recall how neural networks learn). We thus have to reparameterize to make it such.

This means that we have to multiply our sample by the standard deviation and add the mean. This occurs just before the decoder transforms the latent space into the output.

A brief explanation (not necessary to understand, IMO) as to why this works is because sampling according to a distribution is stochastic , or that it involves a random distribution. If we instead use an external Normal Distribution, sample from there and transform it using the mean and standard deviation, the gradients can now be calculated. This works because:

Normal distributions (and samples) can be transformed (due to their properties) using multiplication of the standard deviation and addition of the mean. This allows us to “simulate” the distribution learned by the encoder
Using an external distribution to sample from eliminates the stochastic variable in the computational graph (plenty of examples on sites like Wikipedia and Baeldung)

I’m Losing my mind

The loss here is also different. Besides the reconstruction loss we discussed earlier, there is another (weighted) penalization term used known as the Kullback–Leibler divergence loss (or KL loss)

In layman terms, it penalizes the model when the learned distribution (what the encoder learns) strays from the originial prior distribution (what we assume the original input distribution is; usually a normal distribution).

Those familiar with simple linear regression can use the term “regularisation” to refer to this KL loss.

Put simply, all of that is my attempt at explaining autoencoders. It is difficult to condense such a neural-network and mathematically-heavy model, but I hope that you’ve learnt something new about AEs and VAEs today.

Without a doubt there is more to AEs and VAEs that I have not mentioned, so feel free to Google it or browse the excellently-written articles on Medium.

Until the next Put Simply comes out, Zao-skis.