Variational Autoencoder(VAE)

Roger Yong
Geek Culture
Published in
6 min readJul 8, 2021

As a generative model, the basic idea of VAE is easy to understand: the real sample is transformed into an ideal data distribution through the encoder network, and then this data distribution is passed to a decoder network to obtain the generated sample. If the generated samples and the real samples are close enough, an autoencoder model is trained.

Dimensionality reduction, PCA and autoencoders (AE)

PCA

As shown in the figure, x is a matrix which can becomes a low-dimensional matrix c through a transformation W . Because this process is linear, the transpose of W can be used to restore an x hat. PCA is to find a W through SVD (singular value decomposition) so that the matrices x and x hat be as consistent as possible.

However, the difference between AE and PCA is that AE uses neural networks instead of SVD.

Autoencoder

Variational Autoencoder(VAE)

Encoder

This defines the approximate posterior distribution q(z|x), which takes as input an observation and outputs a set of parameters for specifying the conditional distribution of the latent representation z. In this example, simply model the distribution as a diagonal Gaussian, and the network outputs the mean and log-variance parameters of a factorized Gaussian. Output log-variance instead of the variance directly for numerical stability.

Decoder

This defines the conditional distribution of the observation q(x|z), which takes a latent sample as input and outputs the parameters for a conditional distribution of the observation. Model the latent distribution prior P(z) as a unit Gaussian.

Reparameterization trick

To generate a sample z for the decoder during training, you can sample from the latent distribution defined by the parameters outputted by the encoder, given an input observation x. However, this sampling operation creates a bottleneck because backpropagation cannot flow through a random node.

To address this, use a reparameterization trick. In our example, you approximate z using the decoder parameters and another parameter ε as follows:

where μ and σ represent the mean and standard deviation of a Gaussian distribution respectively. They can be derived from the decoder output. The ε can be thought of as a random noise used to maintain stochasticity of z. Generate ε from a standard normal distribution.
The latent variable z is now generated by a function of μ, σ and ε, which would enable the model to backpropagate gradients in the encoder through μ and σ respectively, while maintaining stochasticity through ε.

How VAE works?

The theoretical basis of VAE is the Gaussian mixture model (GMM). The difference is that our code is replaced by a continuous variable z, and z follow standard normal distribution N(0,1).
For each sample z, there will be two variables μ and σ, which respectively determine the mean and standard deviation of the Gaussian distribution corresponding to z, and then the accumulation of all Gaussian distributions in the integration domain becomes the original distribution P(x):

Where z~N(0,1), x|z~N(μ(z), σ(z)), Since P(z) is known, P(x|z) is unknown, and x|z~N(μ(z),σ(z)). What we really need to solve is the expressions of μ and σ, but P(x) is so complex that μ and σ are difficult to be calculated, we need to introduce two neural networks to help us solve it. :

we hope P(x) the bigger the better, then

Where log P(x)

The second term of the above formula is a value greater than or equal to 0, so we found a lower bound of log P(x)

We denote this lower bound as ELBO:

So we can revise the original form as:

Through adjusting q(z|x) to make ELBO higher and higher, KL divergence is getting smaller and smaller. When we adjust q(z|x) to make q(z|x) and P(z|x) to be the same, KL divergence disappears to 0, ELBO and logP (x) are fully consistent. It can be concluded that we can always adjust ELBO to be equal to logP (x), and because ELBO is the lower bound of logP (x), solving for Maximum logP (x) is equivalent to solving Maximum ELBO.
Adjusting P(x|z) is adjusting the Decoder, and adjusting q(z|x) is adjusting the Encoder. Every time the decoder advances, the Encoder is adjusted to be consistent with it, so that the decoder will only be better after next training epoch.

Therefore, maximize ELBO is equivalent to minimize KL(q(z|x)||P(z)) and maximize the integral equation in the second term of the above figure.

First, check the second term of the above figure:

The above expectation means P(x|z) (Decoder’s output) given that q(z|x) (Encoder’s output) is as high as possible. This is similar to AutoEncoder’s loss function(reconstruction error):

Let’s discuss -KL(q(z|x)||P(z)):

Then, we can write -KL(q(z|x)||P(z)) as:

Conclusion

both EM and VAE are machine learning techniques/algorithms to find the latent variables z. However, despite the overall goal and even the objective function being the same, there are differences because of the complexities of the model.
There are 2 issues at hand where EM (and its variants) have limitations. These are mentioned in the original VAE paper by Kingma.

In the EM algorithm, we can calculate the posterior probability, and in the problem solved by VAE, our posterior is intractable, that is, it cannot be calculated. So we have to approximate this posterior in VAE, which is why we use KL divergence in our formula , and this method is actually a variational approximation to the posterior.

--

--