Variational Deep Embedding

Andrew Elkommos

Follow

Published in

Machine Intelligence and Deep Learning

11 min readApr 28, 2022

--

Variational Deep Embedding Presentation

Clustering is a central task in computer vision and machine learning algorithms. The proposed framework called Variational Deep Embedding (VaDE) demonstrates a novel unsupervised generative clustering approach within the Variational Auto-Encoder framework.

Overview

VaDE models the data generative procedure with a Gaussian Mixture Model and a deep neural network by the following steps:

The Gaussian Mixture Model picks a cluster
From this picked cluster a latent embedding is generated
The DNN decodes the latent embedding into observables

The inference step in VaDE is done using a variational method, in which a different DNN is used to encode observables to latent embeddings, such that the evidence lower bound (ELBO) can be optimized using Stochastic Gradient Variational Bayes estimator.

The motivation behind this work is to do the following:

Learn good representations that capture the statistical structure of the data, and be capable of generating samples.
We can leverage the capabilities of VaDE to be able to generate the face of a person based on certain features that we want in our generated sample.

A sample application of VaDE in which features of faces are combined to generate a new unique sample [2]

Generative models are capable of generating unique samples after sufficiently trained, however there is not much information on the statistical structure of the data, and this is where VaDE really shines. It combines a generative architecture with the ability to cluster data points.

Background into VaDE

To begin to understand the architecture behind VaDE we must start by learning a few key concepts and the Variational Autoencoder network (VaE).

Gaussian Mixture Model

A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of gaussian distribtuions with unknown parameters. [3] Using a concept called soft clustering which means data points belong to multiple probability density functions.

Each Gaussian k in the mixture is comprised of the following parameters:

A mean μ that defines its center.
A covariance Σ that defines its width. This would be equivalent to the dimensions of an ellipsoid in a multivariate scenario.
A mixing probability π that defines how big or small the Gaussian function will be.

We can graphically display these parameters as shown below

Three Gaussian Distributions shown by clusters in 1 dimension [5]

From the three Gaussian Functions we can see that K=3, and each gaussian explains data contained in each of the clusters.

To derive the Gaussian Mixture Model we need to find the probability that a data point x comes from Gaussian k. This can be expressed as the following.

Probability that a give data point x comes from a Gaussian k [5]

The variable z is called a latent variable. Our latent variable is useful in determining the Gaussian mixture parameters.

Latent Variables

A Latent Variable model aims to model the probability distribution with hidden information behind the model.

Inference is the inverse of generation and vice versa [6]

Prior Distribution: p(z) models the behaviors of the latent variables
Likelihood: p(x|z) defines how to map latent variables to data points
Joint Distribution p(x,z) = p(x|z)p(z), is the multiplication of the likelihood and the prior which describes the model
Marginal Distribution p(x): The distribution of the original data, the marginal distribution tells us how possible it is to generate a data point.
Posterior Distribution p(z|x): Describes the latent variables that can be produced by a specific data point.

To Generate a data point, we can sample z from p(z) and then sample the data point x from p(x|z)

Generation of a data point using conditional probability [6]

Alternatively, we can infer a latent variable from sample x given p(x) and then infer a latent variable z from p(z|x)

This then leads us to the question of how we can find all of these distributions of the latent variables. To begin to answer this question we must consider the Bayes rule. Which tells us that we can build each type of model as a combination of the other probabilities.

This leads us to the model in which VaDE is based upon, the Variational Autoencoder (VAE)

Variational Autoencoder

The variational autoencoder network is the same as a Autoencoder network but uses the concept of a latent space, or a vector of latent variables to encode information into the model. To begin to train the latent variables we use Maximum Likelihood Estimation (MLE). The MLE is a technique of estimating the parameters of a probability distribution such that the distribution fits the observed data. [6] The MLE can be described mathematically as:

The MLE for the marginal distribution pθ(x) cannot be solved analytically, however we can repose this problem and solve it using gradient descent. Once solved we can derive model parameters θ which model the desired probability distribution. In order to apply gradient descent, we need to calculate the gradient marginal log-likelihood function. Using calculus and Bayes’ Rule we can solve for the marginal log-likelihood function.

Now we need to solve for the Posterior distribution, in other words solve for the inference portion of our model. Since the above posterior distribution for pθ(z∣x) is intractable we must use variational inference to approximate this distribution. We can use another distribution qϕ(z∣x) called the variational posterior, to approximate the actual posterior.

Evidence Lower Bound

Since the marginal log-likelihood is intractable or unsolvable, we approximate a lower bound Lθ,ϕ(x) called the variational lower bound. As a result, we maximize the lower bound with respect to both the model parameters θ and the variational parameters ϕ. The lower bound can be solved by the ELBO:

Evidence Lower Bound (ELBO)

E is used to denote the expected value. We can extend the ELBO equation further.

The KL denotes the Kullback-Leibler divergence, and measures how similar a probability is from a second probability. The Kullback-Leibler can be expressed as:

The goal of the model is to maximize ELBO which increases logpθ(x), which means that we need to compute the gradients of:

Using monte Carlo Sampling we can generate a handful of samples for the variational posterior and average them. This allows for an estimate of the gradients instead of calculating them in closed form.

Evidence Lower Bound can be visualized from the following animation.

Reparameterization Trick

Since we cannot compute the gradient of an expectation, we can move the parameters of the Probability distribution from the distribution space to the expectation space. We want to rewrite the expectation so that the distribution is independent of the θ.

Instead of having a full stochastic node, we can introduce a term to allow for back propagation.

Now we can discuss the variational autoencoder network diagram. For the main model, we choose a Neural Network and parameterize the variational posterior (also known as the Decoder).

Variational Autoencoder Network Architecture.

We can train the model using variational inference, in which we use another neural network (known as the encoder). The encoder parameterizes the likelihood pθ(x|z). In order to generate samples from the Encoder and pass them into the decoder we must utilize the reparameterization trick as mentioned above. Since we use Gaussians, the encoder will output the mean and variance of the likelihood. Both the encoder and decoder are trained jointly by maximizing the ELBO. We can use the mean and log variance as a loss function in the model.

To summarize VAE:

Pass a datapoint to the encoder which outputs the mean and log-variance of the approximate posterior
Apply the reparametrization trick
Pass the reparameterized samples to the decoder to output the likelihood
Compute the ELBO and backpropagate the gradients

We can also generate data points with VaE by following the steps:

Sample a set of latent vectors from the normal prior distribution
Obtain the latent variables from the encoder
Use the decoder to transform the latent variable to a new data point

Variational Deep Embedding

Now that we are familiar with the Variational Autoencoder we can take a dive into the Variational Deep Embedding Network.

The main difference between the Variational Deep Embedding framework is that it acts as a clustering framework. Which models the data generative process by

Choosing a Cluster picked by the Gaussian Mixture Model
Sampling a latent representation z
Using a Deep Neural Network to decode z to an observation.

This model generalizes the VAE model by using a Gaussian Mixture Model prior to replace the single gaussian prior. This makes VaDE much more suitable for clustering tasks.

Below is an overview of how the VaDE network operates. Refer to the numbers in the figure below:

A Cluster is selected from a Gaussian Mixture Model, and it’s mean and log variance is fed into the network
A latent embedding is generated based on the picked cluster
A DNN decodes the latent embedding into an observable x.
An Encoder network is used to maximize the ELBO of VaDE

Let’s dive a bit deeper into the details of VaDE.

Generative Process of VaDE

VaDE is an Unsupervised generative approach to clustering, we will describe the Generative process below.

According to the Generative process above, the joint probability can be factorized as:

Joint Probability showing of choosing x (sampled data) based on latent variable z and gaussian distribution from cluster c. [1]

The probabilities can be defined as

Probabilities that make up joint probability listed above for generative process [1]

Variational Lower Bound

Similar to the VAE, we use Maximum Likelihood to find the log-likelihood of the VaDE model but we need to add an additional stochastic variable for the cluster.

Log-Likelihood equation for VaDE [1]

In VaDE q(z, c|x) is the variational posterial used to approximate the true posterior p(z, c|x). We can factorize the variational posterior using:

Variational Posterior in VaDE [1]

Then the Loss function for ELBO can be written as

Similar to VaE, we use a neural network g to model q(z|x):

Stochastic Gradient Variational Bayes estimator (SGVB)

By substituting the terms for the Loss function for ELBO and using the SGVB and reparameterization trick, we can rewrite the loss function for ELBO:

ELBO after using SGVB estimator and Reparameterization Trick

Now we can formulate how q(c|x) in the formula above is used to maximize the ELBO, and we can rewrite the Loss function of ELBO such that:

From the equation above, we can see the first term has no relationship with c and the second term is non-negative. Therefore, to maximize the ELBO our KL divergence term must meet the following criteria.

The information loss induced by the mean-field approximation can be mitigated, since p(c|z) captures the relationship between c and z, and we can use the following equation to compute q(c|z)

Once the training is done by maximizing the ELBO with respect to the parameters, we can capture a latent representation z for each observed sample x, and the clustering assignment can be obtained.

The plot below shows the accuracy versus number of epochs using various clustering methods vs the VaDE model. It compares against Deep Embedded Clustering, Adversarial Autoencoder, Local Discriminant Models and Global Integration, and GMM. They compare against one another using MNIST.

Clustering Accuracy Accuracy over number of epochs during training on MNIST [1]

Results

Let’s take a look at the performance of VaDE at a number of Epochs for the clustering tasks.

Demonstration of VaDE for Clustering during Training [1]

We can compare VaDE against different clustering algorithms to see the performance boost that this Unsupervised Model has over other algorithms.

Clustering accuracy (%) performance comparison on all datasets [1]

Now we can visualize how VaDE performs when it comes to functioning as a generative model.

Side by side comparison of different generated samples for MNIST Dataset [1]

Additionally, the latent space can be used to target specific features when it comes to generation of new samples.

Demonstration of different generated faces based on features by rows: 1. Black/Short Hair, 2. Black/long hair, 3. gold/long hair, woman, 4. Bald, sunglasses, man, 5. Left side face, woman, 6. Right side face, woman

Finally, let us take a look at how we can interpolate between different latent spaces using the latent variables. The VaDE model seamlessly transitions from different features from the left side of the diagram to another set of features on the right.

Conclusion

Variational Deep Embedding shows capability in solving the clustering problem using an architecture similar to VAE. It models the data generative process using a GMM model and a neural network. We can optimize the results of VaDE by maximizing Evidence Lower Bound of the log-likelihood of data by the SGVB estimator and the reparameterization trick. VaDE shows that the model can outperform State of the art methods when it comes to clustering and generation. The model can also demonstrate creating highly realistic samples conditioned on cluster information without any supervised information during training.

Works Cited

Here is a link to the paper on VaDE by the authors Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, and Hanning Zhou

[1] Variational Deep Embedding: An Unsupervised and Generative Approach to Clustering by Zhuxi Jiang, Yin Zheng, Huachun Tan, Bangsheng Tang, Hanning Zhou

[2]slim1017/VaDE: Python code for paper — Variational Deep Embedding : A Generative Approach to Clustering (github.com)

[3] 2.1. Gaussian mixture models — scikit-learn 1.0.2 documentation

[4] Gaussian Mixture Model

[5] Gaussian Mixture Models Explained

[6] Latent Variable Models

Variational Deep Embedding

Overview

Background into VaDE

Variational Deep Embedding

Results

Works Cited

Written by Andrew Elkommos