VAE Careful Walkthrough

8 min readMay 8, 2019

Variational Auto Encoder (VAE) is one of the most famous and ground model in generative modeling. In this post I’ll go through math behind the wheels with a small example of the model implemented in PyTorch.

Generative Modeling

Generative modeling is the branch of machine learning concerning models capable of generating new data that possess some wanted features. It started to gain popularity in 2015 when famous Generative Adversarial Model by Ian Goodfellow started to give outstanding results. Generative modeling does not imply using of some concrete type of neural networks (RNN, CNN etc.) but rather employs this basic building blocks to formulate an approach to generate data (GAN, VAE etc.).

Let’s briefly outline the difference between discriminative and generative models. The common discriminative models are requied some 𝑋 inputs to return the probability of some 𝑌 signal. The process that generated 𝑋 does not taken into account, the only thing we are interested in — detecting some patterns in the input to give predicted value of the singal. In most common scenario this process gives outcome probabilities from 𝑋: 𝑃(𝑌|𝑋)

In the case of generative modeling we have datapoints that emerge from some generating process. This process can be seen as a view of some object under different angles, speech from people etc. This generating process can be seen in terms of distributions. For example, image of a cat looks like it is because each pixel distributed in its own specific way. We want our model to “understand” this distribution in such a way to generate new images that contains cats. Each datapoint 𝑋𝑛 lays in some M-dimensional space. We also want to have labels 𝑌𝑛 to use this knowledge while training to direct model want exactly it generated and how well. The process of generating can be seen as 𝑃(𝑋,𝑌). Two main steps concerning generative models are

We train our model w.r.t parameters θ so that output samples distributed Pmodel (our generative model) match the distribution of training data Pdata (𝑃(𝑋))
Later on we sample points from Pmodel that looks like they were distributed in the same way as Pdata

The learned generative model should make up new samples from the given distribution, but not just copy and paste existing ones. To achieve this we want the stochastic component on a scene. The most general way to introduce it, use some noise in the training process (we’ll see where below).

A few scenarious that can employ generative models are

Modeling complex and high-dimensional distributions
Perform data augmentation by generating realistic synthetic samples
Fill blanks in the data

The reason of popularity of generative models that now we have enough data and computation power to build complex generative models, complex enough to give very promising results.

The models that learn probability density function can be roughly divided into two categories. First category learns pdf explicitly, so we impose a known loss function. The second learns pdf implicitly and the loss function can be unknown. VAE is example of the first approach, and GAN is the best known from the second.

Auto-Encoder Nerual Network

Before VAE walkthrough let’s start from simpler model, general autoencoder. Autoencoder predicts at the output the same input data. It consists of two parts: encoder accepts input data and encode it to the latent dimension, which is typially much less than dimension of original data. Decoder decodes produced latent representation to the dimension equals to input’s, in such way that outputted data is as close as possible to the input sample. Auto-encoders do not need labels and can be used for improvement of image quality (removing noise), dimensionality reduction etc. We can say that this model regenerates data, but it is not a generative model. The reason is that model has no stochastic component to generate some novel sample, it is complete deterministic and just repeats input signal.

Variational Auto-Encoder

The key idea to start with VAE concerns missing component of autoencoder. To introduce missing component of stochasticity we add a restriction in 𝑧, such that our samples are distributed in a latent space following a speicifed probability density function 𝑍.

In other words we enforce 𝑧 to have some specific shape, use encoder to map original sample into high-level features under some distribution, and use decoder to reconstruct image, taken into account 𝑧 knowledge. Instead of sampling from original large pixel dimensions we will sample from small 𝑧 (i.e. sample from some distribution, usually normal) and decode our 𝑧 into some realistic sample. Then we can sample 𝑧 to generate new 𝑋 data points.

In such a model we have deterministic neural network as 𝑓 (determenistic after the training process is done), and non-determenistic 𝑧 to sample new images. The goal of neural network is to learn such parameters θ to maximise the probability of each 𝑋 under the generative process: 𝑃(𝑋)=∫𝑃(𝑋|𝑧; θ)𝑃(𝑧)𝑑𝑧

From the previous equation we have a maximum likelihood problem, where we want to know 𝑃(𝑋|𝑧) and 𝑃(𝑧).We don’t want to calculate the integral, so let’s introduce 𝑃(𝑧|𝑋) to sample values from 𝑧 likely to produce 𝑋, not the whole 𝑃(𝑧) possibilities that is too hard from computational perspective. But 𝑃(𝑧|𝑋) is unknown too yet. Now variational inference role is to approximate 𝑃(𝑧|𝑋) with 𝑄(𝑧|𝑋). So the key idea behind is to find an approximation function that is good enough to represent the real one (that’s define our optimization problem).

The approximated function will be our neural encoder that goes from training datapoints 𝑋 to the likely 𝑧 points following 𝑄(𝑧|𝑋) which models 𝑃(𝑧|𝑋)

To get 𝑄(𝑧|𝑋) we compute KL divergence with the true distribution 𝑃(𝑧|𝑋)

applying bayesian rule to the 𝑃(𝑧|𝑋):

last term under 𝐸 gets out of expectation for no dependency over 𝑧 and moves to the left side of equation

let’s change the sign and rearrange terms on the right side a bit:

now expand expectation to the second term and notice new KL term emerging

As a result VAE objective function is

which means that to model log likelihood of our data (log𝑃(𝑋)) we have to take into account some not computable and non-negative approximation error (𝐷[𝑄(𝑧|𝑋)||𝑃(𝑧|𝑋)]). The right side of equation equals reconstruction loss of our data, given latent space (neural decoder reconstruction loss), minus regularization of our latent representation (neural encoder projects over prior).

The next thing to do is to define 𝑄(𝑧|𝑋) shape to compute its divergence against prior (we want map 𝑋 samples over the surface of 𝑧 in a proper way). The proposed way to do so is distribute over normal distribution with predicted moments 𝜇(𝑋) and 𝛴(𝑋). In such a setup our 𝑄(𝑧|𝑋) turns into 𝛮(𝜇(𝑋), 𝛴(𝑋)) and 𝑃(𝑧) into 𝛮(0, 1) which allows us to compute KL-divergence as follows

with that in mind our VAE setup would look like this

This model makes encoder to predict means and standard deviations that are close to the prior distribution (features extracted from our sample should be close to the distribution), then we sample from 𝑧, give it to decoder and reconstruct sample (with some error). Later we will be able remove an encoder and sample just from the prior distribution, assuming decoder can regenerate plausible samples. Summarizing, variational methodology tells that we can project 𝑧 to newly sample, and also get from 𝑧 original samples with some error. This error noted as lower bound, or ELBO (Error Lower Bound Optimization).

In the setup above however one thing does not allow us to make it real. As neural network is trained with backpropagation algorithm, it requires components with flowing gradients. In the same time it is not clear how to connect gradients with 𝐷[𝛮(𝜇(𝑋), 𝛴(𝑋))||𝛮(0, 1)] as this term is not a differentiable function. The workaround of this issue is accomplished by making the sampling process differential with a following trick

which called the reparameterization trick. It allows us to train end-to end system in variational methodology.

After the model is trained we can generate new samples by choosing random mean and standard deviation for 𝑧 and decode it (discarding the encoder component).

PyTorch VAE example

Implementing VAE model in modern frameworks is not too hard, as showed below

The sample snippet above defines model that takes as input flatten data representation, then map it to the latent representation 𝑧 through one hidden layer of given size. Conversion from 𝑧 to the decoder output is also performed with one hidden layer of the same size as in hidden layer in encoder. method forward first gets mean and logarithm of variance and calculates 𝑧 via reparameterization trick implemented in preparam method.

Loss function of the model looks like

where we first calculate reconstruction loss with binary cross entropy and then KL divergence term. The full code of the script can be found here or in official PyTorch repo also.

As a result model learned to produce realistic images (MNIST examples) from about 25 epochs:

References

[1] “Autoencoding Variational Bayes” paper: https://arxiv.org/abs/1312.6114
[2] Tutorial on VAE: https://arxiv.org/abs/1606.05908