Implementing a Variational Autoencoder (VAE) in Pytorch

The aim of this post is to implement a variational autoencoder (VAE) that trains on words and then generates new words. Note that to get meaningful results you have to train on a large number of words. Training on small number of words leads to generating garbage words. Also note that the implementation uses 1 layer GRU for both encoding and decoding purpose hence the results could be significantly improved using more meaningful architectures. The aim is to understand how a typical VAE works and not to obtain the best possible results.

What is VAE?

There are tons of blogs, video lectures that explain VAE is great detail. So here I will only give a brief sketch. VAE is now one of the most popular generative models (the other being GAN) and like any other generative model it tries to model the data. For example VAEs could be trained on a set of images (data) and then used to generate more images like them. If X is the given data then we would like to estimate P(X) which is the true distribution of X. We consider that X depends on some latent variable z and a datapoint x is sampled from P(X|z). Typically, we would like to learn what the good values of z are, so that we can use it to generate more data points like x. So the inference problem is to estimate P(z|X) or in other words can we estimate the latent parameters that generates X from the data provided. According to the Bayes’ rule —

P(z|X) = P(X|z).P(z)/p(X)

Now P(X) = ∫ P(X|z)P(z)dz which in many cases is intractable. The way out is to consider a distribution Q(z|X) to estimate P(z|X) and measure how good the approximation is by using KL divergence. In this post we consider Q to be from gaussian family and hence each data point depends on mean and standard deviation. So, what we typically have is a encoder Q(z|X) and a decoder P(X|z). X* is the generated data.

We will use deep neural networks to learn Q(z|X) and P(X|z). For a detailed review on the theory (loss function, reparameterisation trick look here, here and here). To summarise, we consider the training data to estimate the parameters of z (in our case means and standard deviations), sample from z and then use it to generate X*.

To implementation:

To start with we consider a set of reviews and extract the words out. The idea is to generate similar words. Each word is converted to a tensor with each letter being represented by a unique integer.

Each word is now mapped to a tensor (e.g., [1, 3, 4, 23]). Now lets consider the encoder module —

We use a 1-layer GRU (gated recurrent unit) with input being the letter sequence of a word and then use linear layers to obtain means and standard deviations of the of the latent state distributions. We now sample (there is a reparametrization trick which is used so that the back propagation works, you can read about it the references provided earlier)from this obtained distributions and it is fed as input to the decoder module —

The decoder module is again a 1 layer GRU and a softmax operation is performed over the output to obtain the letters. We can train the network in the following way —

To summarize the training process, we randomly pick a word from the training set obtain the estimates of the parameters of the latent distribution, sample from it and pass it through the decoder to generate the letters. Loss is then propagated back to the network. For detailed derivation of the loss function please look into the resources mentioned earlier.

Once the network is trained, you can generate new words with the code below —

The whole code can be found here.