Variational Autoencoders: An Intuitive Explanation & Some Keras Code

Kevin Y. Guo
A.I./Machine Learning Tutorials
8 min readJan 28, 2020

Introduction

A twist on normal autoencoders, variational autoencoders (VAEs), introduced in 2013, utilizes the unique statistical characteristics of training samples to compress and replenish the original data.¹ Before diving into VAEs, it’s important to understand a normal autoencoder first. The big picture of an autoencoder is to take input, compress that input, and finally, attempt to reconstruct the compressed data back into the original input. For example, we would take an image of an apple, compress it into a latent vector, and from that latent vector, attempt to produce the original image of the apple.

Fig 1. The basic framework of an autoencoder. An encoder takes input data and converts it into a latent vector that serves as a representation of the input data. The decoder must take that latent vector and produce the original input data.

The encoder typically is a network, which can be convolutional, that has one goal: produce a smaller representation of data. The decoder is also a network with an antithetical goal: produce the original data with the smaller representation of the encoder. The entire objective of a VAE is to be able to generate useful encodings with the ability to reconstruct. The latent vector holds all the information needed to reconstruct.

Fig 2. Intuitive understanding of latent vector. The latent vector simply holds variables that are essential to the image and can be used by the decoder to reconstruct the original image. The dog has a number of characteristics (defined as variables) that the decoder will use.

The generative property of VAEs is being used for a myriad of applications. VAEs can be used for feature extraction, which allows large amounts of data to be processed with a much smaller space.² Another potential application is behavioral predictions, both human and artificial.³ As a relatively new concept, VAEs have great potential for the future as new applications come out. Even if your experience is limited, I still encourage you to think of ways to use VAEs to accomplish the seemingly impossible.

*This tutorial assumes knowledge of introductory linear algebra, basic concepts behind convolutional neural networks, and moderate proficiency in Python.

Variational Aspect

Variational autoencoders differ from normal autoencoders with one unique property: the continuous latent space. With a vector of means (μ) and a vector of standard deviations (σ), VAEs allow for easy random sampling and interpolation.⁴

Fig 3. The basic framework of a variational autoencoder. VAEs have an additional layer containing a mean vector and standard deviation vector.

In a normal autoencoder, there is a possibility for the latent space to not be continuous, meaning it would be difficult to interpolate examples due to distinct clusterings. This is highly unfavorable since the decoder would not be truly generative. The distinct encodings would be credited for the ability to reconstruct images. Essentially, the autoencoder attempt to cheat. To truly create artificial intelligence, we need to close loopholes and force our model to train the hard way.

Fig 4. Example latent space of a normal autoencoder v.s. variational autoencoder for images of apples and oranges. In the normal autoencoder latent space, both categories have formed distinct clusterings that do not allow easy interpolation of data. In contrast, the latent space of the variational autoencoder is distributed among a common center, allowing for a true generative model.

Loss Function, Reparameterization Trick, and Kullback–Leibler Divergence

In order to train our VAE, we need a loss function to tell the model how to adjust its weights. However, if we were to do backpropagation, we would hit a fatal error since there’s no way to backpropagate through our sampling node since it is stochastic. The gradient we would find would be just a random variable. The fix? The reparameterization trick: instead of treating our sampling node as a single variable, we can treat it as an equation z=μ +σx, with x being our newly added variable.¹ ⁵ ⁶ We can solve the issue of having only variables we are learning ( and ) by adding an additional variable to be stochastic (x). Another intuitive way to think is to imagine our sampling node (z) as a source of randomness and that we are shifting the randomness to the new variable (x) while keeping all other variables as deterministic.

Fig 5. The reparameterization trick. A clever way to enable backpropagation in a VAE.

Before introducing the loss function, we also need to view VAEs as a set of conditional probabilities since our loss function will ultimately be constructed based on these probabilities. The encoder can be thought of as the probability of producing a latent vector with input data and the decoder can be thought of as the probability of producing the original input data with a latent vector. With that in mind, the loss function is made of two components: the reconstruction loss and a regularizer.

Fig 6. VAE as conditional probabilities and the loss function. The 𝜃 and ɸ represent the individual sets of weights that will be adjusted during training for the encoder and decoder, respectively.

The derivation of the standard VAE loss function is very complicated with nearly thirty steps.⁷ For further interest, the derivation has been well documented in other papers.⁷

The reconstruction loss, or otherwise known as the generative loss, is intuitive as the negative log-likelihood of datapoints represents how well the VAE was able to reconstruct the original data.

In order to account for error in measuring the reconstruction loss, we add a regularizer which is the Kullback–Leibler (KL) divergence of our latent vector and a normal distribution with a mean of 0 and a standard deviation of 1. The KL divergence is measuring the difference in the distributions of our latent vector and a zero mean, unit variance Gaussian distribution. Adding the KL divergence keeps generated latent vectors diverse and avoids having very different representations of similar data. This prevents the VAE from cheating by memorizing representations by plotting them in any area of latent space.

Coding an MNIST-trained VAE

Let’s take what we just learned and code a VAE trained on the MNIST dataset, containing 60,000 training examples and 10,000 test examples of handwritten digits. First, we need to import key functions and the MNIST dataset itself to use.

import numpy as npfrom keras.layers import Input, Dense, Lambda, Layer, Add, Multiplyfrom keras.models import Model, Sequentialimport matplotlib.pyplot as pltfrom scipy.stats import normfrom keras import backend as K#import and load MNIST databasefrom keras.datasets import mnist

Next, we need to load and format the MNIST dataset. We don’t need the training labels since the loss function does not require them in order to penalize our model. The test labels will be used later to plot the latent space on a graph.

#leave out labels as that are not necessary(X_train, _), (X_test, y_test) = mnist.load_data()X_train = (X_train.astype(np.float32) - 127.5)/127.5X_train = X_train.reshape(60000, 784)X_test = (X_test.astype(np.float32) - 127.5)/127.5X_test = X_test.reshape(10000, 784)

The hyperparameters used for our model are ultimately adjustable, allowing for different results on each compile. Therefore, we want to decide early and leave them at the top for experimenting and empirical observations.

#for reproducible resultsseed = 1np.random.seed(seed)#hyperparametersepochs = 200batch_size = 100starting_dim = 784hidden_dim = 256latent_dim = 2

Now, everything is in place to start building a VAE. Let’s build our VAE in the most intuitive way possible: step-by-step starting with the encoder. The encoder will be simple and without convolutions. After setting up the input dimensions, there needs to be a hidden layer that will lead into our mean vector and standard deviation vectors.

#encodere = Input(batch_shape=(batch_size, starting_dim))hidden_layer = Dense(hidden_dim, activation='relu')(e)latent_mean = Dense(latent_dim)(hidden_layer)latent_log_sigma = Dense(latent_dim)(hidden_layer)

Next, we need to make a sampling function in order to produce our sampling node from the mean and standard deviation vectors. Remember that we’re allowed to do this process due to the reparameterization trick. We use a Keras function called Lambda, which allows us to create a specialized layer just for our sampling node.

#sampling functiondef sampling(args):z_mean, z_log_sigma = argsx = K.random_normal(shape=(batch_size, latent_dim),mean=0., stddev=1.0)#reparameterization trick: z=μ+σ⊙xreturn z_mean + K.exp(z_log_sigma) * x#wrap z as a layerz = Lambda(sampling)([latent_mean, latent_log_sigma])

Lastly in the VAE is the decoder. The decoder can intuitively be thought of as the exact opposite of the encoder. To lower complexity, this decoder also does not contain deconvolutions.

#decoderdecoder_hidden_layer = Dense(hidden_dim, activation='relu')decoder_mean = Dense(starting_dim, activation='sigmoid')decoded_hidden = decoder_hidden_layer(z)decoder_output = decoder_mean(decoded_hidden)

Now it’s time to connect the dots to create a VAE. Using functional models, we’ll not only construct our VAE, but we will also make models for a separate encoder and a separate decoder (also known as a generator). We will go into further detail why we made the latter two in a bit.

VAE = Model(e, decoder_output)encoder = Model(e, latent_mean)gen_start = Input(shape=(latent_dim,))gen_hidden_layer = decoder_hidden_layer(gen_start)gen_output = decoder_mean(gen_hidden_layer)generator = Model(gen_start, gen_output)

With a VAE model constructed, we can now train it. However, every network needs a loss function first. The function for the loss function can be split into the construction loss and the KL divergence. After creating the loss function, we now have all the proper tools to start training the VAE.

def vae_loss(before, after):construction_loss = K.sum(K.binary_crossentropy(before, after), axis=1)kl_loss = - 0.5 * K.mean(1 + latent_log_sigma - K.square(latent_mean) - K.exp(latent_log_sigma), axis=-1)return construction_loss + kl_lossVAE.compile(optimizer='rmsprop', loss=vae_loss)VAE.fit(X_train, X_train,batch_size=batch_size,epochs=epochs,verbose=1)

Finally, to see our results come through, let’s plot the latent space and print out images from our generative model. The encoder and decoder, respectively, allow us to perform such tasks. The encoder encodes latent space and the decoder is a generator.

#plot latent spacepoints = encoder.predict(X_test, batch_size=batch_size)plt.figure(figsize=(8, 8))plt.scatter(points[:, 0], points[:, 1], c=y_test)plt.colorbar()plt.show()#plot example generated images from decoderdimensions = 15image_dimension_size = 28figure = np.zeros((image_dimension_size * dimensions, image_dimension_size * dimensions))x_axis = norm.ppf(np.linspace(0.05, 0.95, n))y_axis = norm.ppf(np.linspace(0.05, 0.95, n))for i, yi in enumerate(x_axis):for j, xi in enumerate(y_axis):latent_sample = np.array([[xi, yi]])x_decoded = generator.predict(latent_sample)gen_num = x_decoded[0].reshape(image_dimension_size, image_dimension_size)figure[i * image_dimension_size: (i + 1) * image_dimension_size,j * image_dimension_size: (j + 1) * image_dimension_size]=gen_numplt.figure(figsize=(10, 10))plt.imshow(figure, cmap='Greys_r')plt.show()
Fig 7. Latent plot as created by the encoder. The different labels have taken a relatively normal distribution in their respective areas.
Fig 8. Generator (decoder) results.

Check out the full code: https://github.com/kev-guo/MNIST-VAE

Acknowledgments

This paper was written as part of a RETINA-AI Health Inc. summer internship.

References

[1]:Diederik P Kingma: “Auto-Encoding Variational Bayes”, 2013; [http://arxiv.org/abs/1312.6114 arXiv:1312.6114].

[2]:H. Nishizaki, “Data augmentation and feature extraction using variational autoencoder for acoustic modeling,” 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, 2017, pp. 1222–1227.

[3]:Haowen Xu, Wenxiao Chen, Nengwen Zhao, Zeyan Li, Jiahao Bu, Zhihan Li, Ying Liu, Youjian Zhao, Dan Pei, Yang Feng, Jie Chen, Zhaogang Wang: “Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications”, 2018; [http://arxiv.org/abs/1802.03903 arXiv:1802.03903]. DOI: [https://dx.doi.org/10.1145/3178876.3185996 10.1145/3178876.3185996].

[4]:Irhum Shafkat, “Intuitively Understanding Variational Autoencoders,” Feb 4, 2018, Towards Data Science, [https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf].

[5]:Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR).

[6]:Ming Xu, Matias Quiroz, Robert Kohn: “Variance reduction properties of the reparameterization trick”, 2018; [http://arxiv.org/abs/1809.10330 arXiv:1809.10330].

[7]:Stephen Odaibo: “Tutorial: Deriving the Standard Variational Autoencoder (VAE) Loss Function”, 2019; [http://arxiv.org/abs/1907.08956 arXiv:1907.08956].

--

--