Latent Variable Model: Variational Autoencoders

What are latent variable models? What is their objective, and how to formulate it so that they can be trained efficiently?

Published in

QLU.ai

13 min readJun 15, 2022

Introduction
Deep Generative Modeling is used to model real-world data distribution. This type of learning has many applications in synthesizing new data, anomaly detection, learning semantically rich representations of data, and more. There are various types of deep generative models, based on how they are trained, and how they learn the data distribution. In this particular article, we will focus on Variational Autoencoders that belong to a class of latent variable models. We will go in-depth into what are latent variables models, how to formulate their objectives, and how they are trained.
(The code for this article is available on the following link: code)

Note: The article assumes a background understanding of probability theory (Random variables, Expectations, Joint and Conditional Probability, etc), KL-divergence, Autoencoders, and Maximum Likelihood Estimation.

Let’s suppose we have a collection of human faces. For example, we have CelebA (CelebFaces Attributes) dataset. Below are some samples from the CelebA-HQ dataset.

Now suppose looking at the above faces we want to make a new face. What will be the intuitive process? Well, we first draw a rough sketch for the structure of the face, then we decide on the facial attributes, hair color, etc, and finally, we will come up with a new face. So intuitively before making an image of a face, there are some variables in the image that are important before drawing a face.
Such factors or variables are called Latent Variables. These variables don’t appear explicitly but are important in the data generation process. So, when drawing a new face image, we go from a low-level representation of data (Latent Variables) to a high-level representation. Now, define the above scenario mathematically. In that case, we have high dimensional data x ϵ X^D (face images, D-dimensional vector), and for each image, we have some low dimensional latent variables z ϵ Z^M (pose, face color, hairs, etc M-dimensional vector). The generative process of a face image can be described as

The above notation denotes z and x are being sampled from respective probability distributions. What it means is that we first obtain some latent vector z (for example, deciding facial attributes and all), and then we generate a face image x based on those latent vectors. One thing to note is that both image and latent variables are sampled from a probability distribution. To understand, why distributions come into play, think about the real-world variables such as the height of people, average plant height, etc. All these variables have a certain valid range, and there are certain values for each variable, that are more probable than others. Hence, real-world variables can be thought of as random variables following some probability distribution. Similarly, in order to understand the conditional distribution above, let’s say I pick a random height, for example, 174 cm, the probability of the individual being a male for that height would be higher than being a female as generally males are taller. Hence, the occurrence of some variables affects the likelihood of other variables.

Model and the Objective

The idea of the latent variable model is that, if we have some distribution of latent variables z and we know the conditional distribution P(x|z), we can get P(x) from the probability theory as follows:

Now P(z) is prior. P(x|z) is a distribution whose parameters can be learned to maximize the likelihood of the data under that distribution. For the sake of this article, let’s take prior as normal distribution N(0,1), and P(x|z) a Gaussian whose parameters (mean and sigma) are learned by neural networks as a function of z.

Hence, the marginal likelihood is now given as:

The objective is to maximize P(x) given the data x ϵ X^D (x is D dimensional). We have n data points (training points) that we assume are independent of each other, and the total likelihood of the dataset is given as the product of the likelihood of each data point x^(i). Normally, we maximize log-likelihood, which transforms the product into summation and the total log-likelihood is given as the sum of the log-likelihood of each data point x^(i):

The θ in the above equation comes from the fact, that we define P(x|z) as some distribution parameterized by θ, which can be optimized to fit the distribution to the dataset.
Let’s first look at how to solve the above integral, once we have a solution for it, we can derive the equation for maximizing P(x).
It can be inferred, that the integral inside the log is not tractable (no analytical solution). Hence, we have to numerically integrate it. But normally, we work in high-dimensional spaces, so there is a curse of dimensionality and we cannot integrate it numerically. What else we can do? We can rewrite the above equation as an expectation concerning P(z).

Intuitively, Expectation is the average value of a random variable when the sample size goes to infinity. (To see it into action, write a program in Python, to simulate a roll of a dice. Then draw 10, 100, 1000, and 10000 samples from it respectively, and take an average of each list individually. The average values will converge to 3.5, the Expected value of a Dice roll). From this knowledge, we can approximate the above equation as:

So, we draw K samples z¹, z²,….z^k from p(z) for each data point x^i, and approximate the expectation from the sample average. If we increase K, the more accurate would be the approximation. But in practice, when dealing with images, this also fails because of dimensionality, the number of samples drawn to cover the space grows exponentially with dimensions. What it means is, given a data point x^i there will be some meaningful latent representations Z. And if we are sampling from a very large space, it is very unlikely that we get those latent representations, ultimately most of the samples are wasted and nothing meaningful is learned.

Importance Sampling

Above, we conclude that by doing Monte Carlo we can approximate equation (4), but due to high dimensions it fails, as we have to take a large number of samples to better approximate it. How about if we have a family of distributions q_ϕ(z), that can give important z samples given a data point x^i. If we sample from q_ϕ(z), then we can approximate equation (4) more accurately.

Mathematically we can derive from equation (4) as:

Hence the approximation is given as

This approximation will be better than equation (6), as we are sampling meaningful z’s from q_ϕ(z) given a data point x^(i). Then what should be the proposal distribution q_ϕ(z)? It can be derived that the optimum choice of q_ϕ(z) will be posterior distribution P(z|x). The posterior distribution will give us which z’s are likely given a data point x^(i). From the Bayes rule, we can write posterior as:

Variational Approach

The posterior in equation (8) cannot be calculated as it needs P(X), which we are trying to estimate. Hence, we cannot directly sample from the posterior, but we can propose q_ϕ(z) as some parameterized distribution (e.g Gaussian with some mean and variance), easy to work with and try to find a parameter setting that makes it close as possible to posterior P(z|x). The closer the learned distribution q_ϕ(z) is to the actual posterior P(z|x), the samples drawn will be more meaningful and accurate. How can we make q_ϕ(z) close to P(z|x)? Well, we do this by minimizing KL divergence between q_ϕ(z) and P(z|x). (KL divergence is used to measure the distance between two distributions).
The KL divergence between two distributions q and p are given as:

Now, writing expressions for minimizing KL:

By simplifying the above, we get

The above equation can be optimized using SGD or any other optimizer. The last term in the equation log p_θ(x^(i)) is independent of z and can be neglected. Apart from it, all the terms can be computed and the Expectation can be approximated using sample averages. In this way, for every data sample x^(i) or image, we can learn a distribution q_ϕ(z), that will give important z samples to approximate equation (7) accurately.

Amortized Inference

From the above, for every x^(i) we want to learn q_ϕ(z), to get good z samples. Instead, of learning q_ϕ(z) separately for each x^(i), we can parameterize a neural network that takes x^(i) as an input and return q_ϕ(z|x) as output, minimizing the objective of equation (9) for each sample.

Mathematically, the amortized formulation is given as:

This allows us to learn the distribution q_ϕ(z|x)for every x^(i) in a single inference of a neural network. This will be faster, but less precise as we have used two approximations. First, the distribution q_ϕ(z|x), and then we are using a single network to predict q_ϕ(z|x) for every x^(i).

Variational Lower Bound (VLB) / Evidence Lower Bound (ELBO)

To derive the expression of VLB or ELBO, we can re-write expression (9) in terms of log P(X) as:

So, now we have a whole picture. Our log-likelihood of data is equal to Variational Lower Bound (VLB) + KL divergence term. The KL divergence is always greater than or equal to zero, so log-likelihood will always be greater than or equal to VLB/ELBO. The gap between log-likelihood and ELBO is given by KL divergence. If our Variational distribution q_ϕ(z|x) is equal to posterior p(z|x), then the KL divergence term would be zero, and the log-likelihood will be equal to VLB/ELBO.

Our objective is to maximize the VLB/ELBO term, as we know the optimal q_ϕ(z|x) that will maximize the above equation is q_ϕ(z|x) = p(z|x). So, if we maximize the ELBO term, we automatically push the KL divergence term close to zero. Also, there may be a notion, that minimizing the KL-divergence term, brings the objective to a lower value, but when KL-divergence is minimized, the increase in ELBO will be greater than the decrease in KL-divergence. Hence, the overall objective will increase.

Variational Autoencoder

Now, let’s put the above equation into an auto-encoder setting. Re-writing the ELBO term from the above:

We want to maximize ELBO, but in deep learning, we normally work with a minimization objective. Maximizing ELBO is equivalent to minimizing negative ELBO. Hence, our loss objective is:

This is what the flow of computation looks like.
You take a batch of data points x¹, x²,….x^k, and map these to z’s by sampling from posterior q_ϕ(z| x¹), q_ϕ(z|x² ), …. q_ϕ(z|x^n). We are encoding x to some latent representation z by using q_ϕ(z|x), hence we can think of q_ϕ(z|x) as a kind of encoder process. (In an autoencoder, an encoder maps x into some latent representation z). VAE’s encoder is different from a normal encoder in the way, that it gives distribution over the latent vectors z given data samples x.
Normally, we take q_ϕ(z|x) as multivariate Gaussian with mean vector μ and diagonal covariance matrix Σ. So, if our z has 32 dimensions, then a 32-dimensional mean vector μ and a 32-dimensional Sigma vector Σ will be returned.
To get the latent representations, we have to sample from q_ϕ(z|x). Also, to evaluate the objective in equation (11), we have to approximate the expectation with respect to q_ϕ(z|x), by using Monte Carlo.
Once, we have our z’s, we can pass those z’s to a decoder p_θ(x|z). The decoder takes latent vectors z’s as inputs and tries to reconstruct our original samples x.
If we look at the objective, term 1 tries to reconstruct x^ closer to the original x. This becomes more clear, if we take p_θ(x|z) as Gaussian distribution, and the log of Gaussian will give a simple MSE loss with some scaling factors.
The second term pushes the posterior q_ϕ(z|x) closer to prior p(z) while hoping it learns some interesting latent representations of the input. When generating new data, we sample from prior p(z) and then use a decoder to get a new sample x. The closer the posterior q_ϕ(z|x) is to prior p(z), the latent sample from p(z) will be more meaningful.

Gradients in the Network

Let’s look at how the gradients will be computed in the network. Here, I am using θ to denote all the parameters in the decoder network, and ϕ to denote all parameters in the encoder network.

Gradients with respect to decoder parameters.
The only term in the above equation, that depends on θ is the term 3, so the gradients with respect to decoder parameters are given as:

The above gradients are relatively simple to compute. The only term that we approximated is the expectation by taking K z samples. In practice, we only take one or two z samples and then compute the above gradients. Note we move the gradient inside the expectation, well we can do this, as q_ϕ(z|x) is independent of θ, and it can be treated as constant while computing gradients.

Gradients with respect to encoder parameters.
When we were taking gradients with respect to θ, we moved the gradient inside, and then we approximated the expectation easily. But now, for gradients with respect to ϕ (encoder parameters), we cannot do the same. Because this time, we cannot treat q_ϕ(z|x) as constant, it depends on the parameters of the encoder. Also, all the three terms in the loss depend on ϕ. Let’s suppose somehow we know how to take gradients, but there is another problem apart from that. Let’s suppose we pass input to the encoder and it returned distribution q_ϕ(z|x). We sample some z’s from the distribution and pass them to the decoder. Now when we take gradients, with respect to ϕ, the gradients will flow back from the decoder, and reach the point where z’s were sampled from distribution. Now, the problem is sampling is not differentiable, hence we cannot pass gradients back to our encoder. So, how do we tackle these two problems?
Reparamaterization
Suppose, we have a function f(x), and we want to compute its expected value with respect to p(x,θ).

Suppose, p(x, θ) is a Gaussian with parameters θ = (μ, σ)
Now, if we want to sample from p(x, θ), we have two ways. One way, we directly sample from p(x, θ). In the other way, we first sample epsilon from standard normal ϵ ~N(0,1), and then define x as a transformation of ϵ.

Using this equivalence, we can compute the gradient with respect to θ as:

From above, the standard normal does not depend upon θ, hence we can push the gradient operator inside the integral, and approximate the expectation using Monte Carlo.

Applying the above reparameterization trick to our problem, instead of sampling from q_ϕ(z|x), we sample ϵ from a standard normal and then apply a transformation to get our z’s. The μ’s and Σ’s we get from the encoder. Just like above, we will be able to push the gradient inside the expectation. Also, reparameterization allows us not to directly sample from q_ϕ(z|x), but we sample from standard normal (ϵ can be treated as auxiliary inputs), apply a deterministic transformation to get z’s, and pass it to the decoder. So, the flow from the encoder to the decoder is completely differentiable, hence we can pass decoder gradients back to the encoder.

Encoder to Decoder, smooth deterministic flow

Conclusion

In this article, we briefly look at the concept of latent variable models, derive the objective functions, and discuss the challenges to evaluate the objective functions. Then, we come at the strategy of variational inference, and how it helps us to train latent variable models, effectively. This article was a mathematical introduction to the concept of VAE, there are many modified variants of VAE, but the fundamental concept in all modifications is based on vanilla VAE. For code, follow the link. code

References:

Deep Generative Modeling book by Jakub M. Tomczak
CS294–158-SP20 (Deep Unsupervised Learning) lectures from UC Berkeley
CS236 (Deep Generative Models) notes from Stanford