A Gentle Introduction into Variational Autoencoders

Imagine this: you’ve spent forever perusing the internet for images, and you’ve finally found the perfect image to put inside of your presentation. You save the image and move it to your presentation, when you realize, the image has a watermark! You angrily start to pick up your water bottle to throw at the computer when you remember a computer program you created a while back in your AI class: the perfect way to remove watermarks, using autoencoders.
Well, that’s a bit of an understatement about what autoencoders can do, but still an important one nonetheless! Autoencoders are used in a wide variety of things from dimensionality reduction to image generation to feature extraction. Autoencoders allow you to replicate the works of Picasso, scale down terabytes of data, and denoise grainy images from security cameras. Let’s first start with how to make general autoencoders, and then we’ll talk about variational autoencoders.
The Basics of Autoencoders

Above is the simplest representation of an autoencoder. It consists of three major parts: the encoder, the bottleneck, and the decoder.
The encoder is how the model learns how to reduce input data and compress it into an encoded representation that the computer can use later for reconstructing the image. The encoder generally takes the form of a simple CNN with convolution and dropout layers. When coding an encoder, I find that using a Leaky ReLU activation function also works better than a normal ReLU activation function. A sample encoder taking in an input of a 28x28 image, returning a bottleneck layer of size 8, and using a Leaky ReLU activation function is seen below:
def lrelu(x, alpha=0.3):
return tf.maximum(x, tf.multiply(x, alpha))def encoder(X_in, keep_prob=0.8):
activation = lrelu
with tf.variable_scope("encoder", reuse=tf.AUTO_REUSE):
X = tf.reshape(X_in, shape=[-1, 28, 28, 1])
x = tf.layers.conv2d(X, filters=64, kernel_size=4, strides=2, padding='same', activation=activation)
x = tf.nn.dropout(x, keep_prob)
x = tf.layers.conv2d(x, filters=64, kernel_size=4, strides=2, padding='same', activation=activation)
x = tf.nn.dropout(x, keep_prob)
x = tf.layers.conv2d(x, filters=64, kernel_size=4, strides=1, padding='same', activation=activation)
x = tf.nn.dropout(x, keep_prob)
x = tf.contrib.layers.flatten(x)
x = tf.layers.dense(x, units=8)
return x
The bottleneck, or the area between the encoder and the decoder, is what the compressed form of the input data is. The data is encoded in a latent space in n dimensions where n is the number of outputs you have in the bottleneck. It is important to remember that n is a hyperparameter that you set, and that the more n is, the closer the bottleneck will represent the actual image, but its representation will require more storage. Bottlenecks can be used for feature extraction and image compression, as the original image can be compressed in smaller dimensions, thereby requiring less storage to hold.
The decoder takes this compressed input and tries to remake the data from the encoded representation for reconstruction of the original image. The decoder once again takes the form of a simple CNN with convolution and dropout layers. The model is trained by comparing the original image to the reconstructed image, creating the reconstruction loss, which is minimized when the network is being updated. A sample decoder taking in a bottleneck layer of 8 inputs, returning an output of a 28x28 image, and using a Leaky ReLU activation function can be seen below:
def decoder(z, keep_prob=0.8):
with tf.variable_scope("decoder", reuse=tf.AUTO_REUSE):
x = tf.layers.dense(z, units=8, activation=lrelu)
x = tf.layers.dense(x, units=49, activation=lrelu)
x = tf.reshape(x, [-1, 7, 7, 1])
x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=2, padding='same', activation=tf.nn.relu)
x = tf.nn.dropout(x, keep_prob)
x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=1, padding='same', activation=tf.nn.relu)
x = tf.nn.dropout(x, keep_prob)
x = tf.layers.conv2d_transpose(x, filters=64, kernel_size=4, strides=1, padding='same', activation=tf.nn.relu)
x = tf.contrib.layers.flatten(x)
x = tf.layers.dense(x, units=28*28, activation=tf.nn.sigmoid)
img = tf.reshape(x, shape=[-1, 28, 28])
return img
A representation below shows how watermarks and noise can be removed using autoencoders. Instead of finding the reconstruction loss between the input image and the decoded image, we find the reconstruction loss between the image without noise and the decoded image.

Variational Autoencoders
Note: Variational autoencoders are slightly more complex than general autoencoders, and require knowledge of concepts such as normal distributions, sampling, and some linear algebra.

Variational autoencoders build on the concept of general autoencoders, but instead of the decoder taking in the bottleneck vector, it now takes in a sample of the bottleneck vector. This prevents overfitting of the autoencoder, as every point in the bottleneck vector is not being used to train the decoder. To further reduce overfitting, the sampled distribution is taken from points of a normal distribution (N(0,1)).
Simply put, an variational autoencoder is one whose training is regularized to avoid overfitting and ensures that the latent space is able to enable the generative process. It samples points from the latent space of an encoded vector and passes those in as inputs to the decoder.
On a deeper level, the encoded vector is split up into two vectors: a mean vector and a sample deviation vector. These vectors are what backpropogation is run upon to update the weights of both the encoder and decoder. You may be wondering, does the loss function to train the network remain the same as a general autoencoder?
Not quite. Although the reconstruction loss is still used in the loss, another term is added to it: a regularization loss (Kulback-Leibler divergence of KL), which makes the distributions returned by the encoder (the mean vector and the standard deviation vector) close to a standard normal distribution. Assuming a decoder of d, and a sample of z, the loss function is as follows:

To express regularity or a normal distribution of the vector space in less mathematical terms, it can be explained using two terms: continuity and completeness. The latent space strives to accomplish these two attributes while training. Continuity is the condition where two close points in the latent space should not give two completely different contents when decoded. Completeness is the condition that a point sampled from the latent space should give meaningful content once decoded.
Project Idea
Now that you’ve learned about the theory behind variational autoencoders, it’s now time to put it to the test by actually coding one up yourself. Your first project will be generating numbers that resemble those from the MNIST dataset using Tensorflow. The final code can be seen here:
Good luck, and I hope this project will show you the incredible power of variational autoencoders! From its applications in film to security, variational autoencoders will undoubtedly be a driving force in AI for the future.
TL;DR
- Autoencoders serve a variety of functions, from removing noise to generating images to compressing images.
- General autoencoders consist of three parts: an encoder, a bottleneck, and a decoder. A bottleneck is the compressed form of your image of n dimensions where n is the number of outputs.
- General autoencoders are trained using a reconstruction loss, which measures the difference between the reconstructed and original image.
- Variational autoencoders are mostly the same, but they use a sampling of the bottleneck vector from a normal distribution to reduce overfitting.
Further Reading
Note: These are listed in order from easiest to understand to hardest to understand. I recommend starting with the resource at the top to start!
- Article: Autoencoders: Its Basics and Uses
- Video: Variational Autoencoders
- Article: Understanding Variational Autoencoders
- Article: Using a Variational Autoencoder to Draw MNIST Characters
If you want to talk more about hyperloops or anything else, schedule a meeting: Calendly! For information about projects that I am currently working on, consider subscribing to my newsletter! Here’s the link to subscribe. If you’re interested in connecting, follow me on Linkedin, Github, and Medium.