Implementing Variational Autoencoders from scratch

Konstantin Sofeikov
11 min readApr 25, 2023

--

In this article we will be implementing variational autoencoders from scratch, in python.

What are autoencoders and what purpose they serve

Autoencoder is a neural architecture that consists of two parts: encoder and decoder. The decoder follows the encoder, and in the middle there is so called hidden layer, that has various names, can sometimes be referred to as bottleneck layer, latent space, hidden layer, encoding layer or code layer. This looks something like this:

The architecture of an autoencoder

Autoencoders serve various purposes. Probably the most obvious thing is data compression: it is obvious that when the input signal is passed through the encoder part, the latent representation of the image is much smaller in size. For example, in the image above, while the input signal is represented with 8 values, its compressed representation only takes 3.

Autoencoders can serve various other purposes too: data denoising, feature learning, anomaly detection and, as we shall later see in the article, generative modelling. Here is a few papers that you may find interesting to read:

  1. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. doi: 10.1126/science.1127647. Link: http://www.cs.toronto.edu/~hinton/absps/science.pdf
  2. Bengio, Y., Lamblin, P., Popovici, D., & Larochelle, H. (2007). Greedy Layer-Wise Training of Deep Networks. In Advances in Neural Information Processing Systems 19 (pp. 153–160). Link: https://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf
  3. Le, Q. V., Ngiam, J., & Coates, A. (2013). Building high-level features using large scale unsupervised learning. In International Conference on Machine Learning (pp. 91–99). Link: http://proceedings.mlr.press/v28/le13.pdf
  4. Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In International Conference on Learning Representations. Link: https://arxiv.org/abs/1312.6114

Autoencoders implementation

We will be working with the well known MNIST dataset. To download MNIST into your local folder, run this:

# Download the files
url = "http://yann.lecun.com/exdb/mnist/"
filenames = ['train-images-idx3-ubyte.gz', 'train-labels-idx1-ubyte.gz',
't10k-images-idx3-ubyte.gz', 't10k-labels-idx1-ubyte.gz']
data = []
for filename in filenames:
print("Downloading", filename)
request.urlretrieve(url + filename, filename)
with gzip.open(filename, 'rb') as f:
if 'labels' in filename:
# Load the labels as a one-dimensional array of integers
data.append(np.frombuffer(f.read(), np.uint8, offset=8))
else:
# Load the images as a two-dimensional array of pixels
data.append(np.frombuffer(f.read(), np.uint8, offset=16).reshape(-1,28*28))

# Split into training and testing sets
X_train, y_train, X_test, y_test = data

# Normalize the pixel values
X_train = X_train.astype(np.float32) / 255.0
X_test = X_test.astype(np.float32) / 255.0

# Convert labels to integers
y_train = y_train.astype(np.int64)
y_test = y_test.astype(np.int64)

So now we have our training and testing set, and we are ready to implement an architecture. But first, let’s just look at our images with this:

def show_images(images, labels):
"""
Display a set of images and their labels using matplotlib.
The first column of `images` should contain the image indices,
and the second column should contain the flattened image pixels
reshaped into 28x28 arrays.
"""
# Extract the image indices and reshaped pixels
pixels = images.reshape(-1, 28, 28)

# Create a figure with subplots for each image
fig, axs = plt.subplots(
ncols=len(images), nrows=1, figsize=(10, 3 * len(images))
)

# Loop over the images and display them with their labels
for i in range(len(images)):
# Display the image and its label
axs[i].imshow(pixels[i], cmap="gray")
axs[i].set_title("Label: {}".format(labels[i]))

# Remove the tick marks and axis labels
axs[i].set_xticks([])
axs[i].set_yticks([])
axs[i].set_xlabel("Index: {}".format(i))

# Adjust the spacing between subplots
fig.subplots_adjust(hspace=0.5)

# Show the figure
plt.show()

The autoencoder architecture is really simple

import torch.nn as nn

class AutoEncoder(nn.Module):
def __init__(self):
super().__init__()

# Set the number of hidden units
self.num_hidden = 8

# Define the encoder part of the autoencoder
self.encoder = nn.Sequential(
nn.Linear(784, 256), # input size: 784, output size: 256
nn.ReLU(), # apply the ReLU activation function
nn.Linear(256, self.num_hidden), # input size: 256, output size: num_hidden
nn.ReLU(), # apply the ReLU activation function
)

# Define the decoder part of the autoencoder
self.decoder = nn.Sequential(
nn.Linear(self.num_hidden, 256), # input size: num_hidden, output size: 256
nn.ReLU(), # apply the ReLU activation function
nn.Linear(256, 784), # input size: 256, output size: 784
nn.Sigmoid(), # apply the sigmoid activation function to compress the output to a range of (0, 1)
)

def forward(self, x):
# Pass the input through the encoder
encoded = self.encoder(x)
# Pass the encoded representation through the decoder
decoded = self.decoder(encoded)
# Return both the encoded representation and the reconstructed output
return encoded, decoded

In order to train this thing, we do not actually need the labels of images, because we are dealing with an unsupervised approach. We are using the simple Mean Squared Error loss since we want to reconstruct our images in the most precise way possible. Let’s do some preparatory work:

# Convert the training data to PyTorch tensors
X_train = torch.from_numpy(X_train)

# Create the autoencoder model and optimizer
model = AutoEncoder()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Define the loss function
criterion = nn.MSELoss()

# Set the device to GPU if available, otherwise use CPU
model.to(device)

# Create a DataLoader to handle batching of the training data
train_loader = torch.utils.data.DataLoader(
X_train, batch_size=batch_size, shuffle=True
)

Finally, the training loop looks really simple, and nothing unexpected:

# Training loop
for epoch in range(num_epochs):
total_loss = 0.0
for batch_idx, data in enumerate(train_loader):
# Get a batch of training data and move it to the device
data = data.to(device)

# Forward pass
encoded, decoded = model(data)

# Compute the loss and perform backpropagation
loss = criterion(decoded, data)
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Update the running loss
total_loss += loss.item() * data.size(0)

# Print the epoch loss
epoch_loss = total_loss / len(train_loader.dataset)
print(
"Epoch {}/{}: loss={:.4f}".format(epoch + 1, num_epochs, epoch_loss)
)

To calculate loss, we simply take the input images and compare it to the reconstructed version of it. The training should only take a few seconds, and if we compare the output and the input images, this is what we see:

The top row is the original images with the reconstructed images depicted in the bottom row.

There a few interesting observations to make:

  1. The reconstructed images are blurry. This is because the reconstruction is not perfect.
  2. The last column demonstrates that this type of compression is not free, it comes at a price of making mistakes during the decoding.
  3. We only use 8 hidden units, increasing the number of hidden units will lead to an improved image quality, whereas decreasing them make the blur worse.

Here it is. With 2 hidden elements, instead of 8:

Not only we see substantially more blurry images, but also the autoencoder is making some obvious mistakes when it tries to decode the latent space.

With 32 elements in the middle:

Interestingly we do not see much improvement over using just 8 neurons in the middle. That not very surprising, albeit hard to predict. In this sort of systems, only experiments can show what’s best. Seems like for this specific task, using 32 units for the latent space is an overkill.

It’s all good with the concept above. However, there is something that architecture is very weak at. Specifically, it’s very hard ti generate new data. Think about it, if we through away the encoder part and start with just latent layer, we should be able to arrive to a meaningful image at the end. However, it is unclear how to sample the latent space in a meaningful way, that is come up with a reliable strategy of sampling that would ensure that the output images are readable and not just some random noise.

While, it actually can be possible to sample 2- or 3-dimensional space, if we try to work with more complicated objects, it might be much harder. Let’s try to do that. First of all, here are of all images in the training set.

Embeddings of all images in the training set. Given how some of them overlap, it is not very surprising why certain images are blurred and look alike(say 3 vs 8).

Now what we do is generate a bunch of samples from this latent space. It’s a bit hard to generate samples from this exact latent space distribution, so we generate samples from the rectangle instead. This is what we get:

Sampling the learned latent space of the autoencoder.

While some of the samples look quite alright, it is going to be more difficult to sample the same space in a meaningful way of the dimensionality of that space was higher. For example, if we increase the dimensionality to 32, this is what we get out of a similar experiment:

Digits are not recognisable anymore, and you have to sample far longer to get something that makes visual sense. Is there a better way?

Variational Autoencoders

The original paper introducing Variational Autoencoders (VAEs) is titled “Auto-Encoding Variational Bayes” and was published by Diederik P. Kingma and Max Welling in 2014. You can find the paper on the arXiv preprint server at the following link: https://arxiv.org/abs/1312.6114

VAEs give us a more flexible way to learn a latent representation of a domain. The basic idea is very simple: we learn parameters of a distribution of the latent space that we will sample from. Here is how it looks on a diagram:

We use input data and learn vectors of means and variances to later use them to sample the latent variable that is to be decoded with the Decoder

This is great, but there is a problem with the setup, specifically with the block that samples the latent variables. The thing is that the sampling operation is a non-differentiable one. To avoid this problem, there is something called the re-parametrization trick. It works like this: instead of having samples coming from that block, we learn two vectors of means and variances explicitly, and then have an independent block that just samples from

So instead of directly sampling those distributions, we do the following:

Where 𝐿𝑖 represent a component of the latent representation. So our diagram now looks like this:

Now our process is differentiable. Let’s code it:

class VAE(AutoEncoder):
def __init__(self):
super().__init__()
# Add mu and log_var layers for reparameterization
self.mu = nn.Linear(self.num_hidden, self.num_hidden)
self.log_var = nn.Linear(self.num_hidden, self.num_hidden)

def reparameterize(self, mu, log_var):
# Compute the standard deviation from the log variance
std = torch.exp(0.5 * log_var)
# Generate random noise using the same shape as std
eps = torch.randn_like(std)
# Return the reparameterized sample
return mu + eps * std

def forward(self, x):
# Pass the input through the encoder
encoded = self.encoder(x)
# Compute the mean and log variance vectors
mu = self.mu(encoded)
log_var = self.log_var(encoded)
# Reparameterize the latent variable
z = self.reparameterize(mu, log_var)
# Pass the latent variable through the decoder
decoded = self.decoder(z)
# Return the encoded output, decoded output, mean, and log variance
return encoded, decoded, mu, log_var

def sample(self, num_samples):
with torch.no_grad():
# Generate random noise
z = torch.randn(num_samples, self.num_hidden).to(device)
# Pass the noise through the decoder to generate samples
samples = self.decoder(z)
# Return the generated samples
return samples

Now the key how it how we train this stuff. Let’s define our loss function:

# Define a loss function that combines binary cross-entropy and Kullback-Leibler divergence
def loss_function(recon_x, x, mu, logvar):
# Compute the binary cross-entropy loss between the reconstructed output and the input data
BCE = F.binary_cross_entropy(recon_x, x.view(-1, 784), reduction="sum")
# Compute the Kullback-Leibler divergence between the learned latent variable distribution and a standard Gaussian distribution
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
# Combine the two losses by adding them together and return the result
return BCE + KLD

The first component we are already familiar with — it’s just reconstruction error. The second component introduces penalty for the learned distribution to deviate too much from the prior distribution. Our prior distribution is just a bunch of zero-centred unit variance normal distributions.

So to train the model we simple can use:

def train_vae(X_train, learning_rate=1e-3, num_epochs=10, batch_size=32):
# Convert the training data to PyTorch tensors
X_train = torch.from_numpy(X_train).to(device)

# Create the autoencoder model and optimizer
model = VAE()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Define the loss function
criterion = nn.MSELoss(reduction="sum")

# Set the device to GPU if available, otherwise use CPU
model.to(device)

# Create a DataLoader to handle batching of the training data
train_loader = torch.utils.data.DataLoader(
X_train, batch_size=batch_size, shuffle=True
)

# Training loop
for epoch in range(num_epochs):
total_loss = 0.0
for batch_idx, data in enumerate(train_loader):
# Get a batch of training data and move it to the device
data = data.to(device)

# Forward pass
encoded, decoded, mu, log_var = model(data)

# Compute the loss and perform backpropagation
KLD = -0.5 * torch.sum(1 + log_var - mu.pow(2) - log_var.exp())
loss = criterion(decoded, data) + 3 * KLD
optimizer.zero_grad()
loss.backward()
optimizer.step()

# Update the running loss
total_loss += loss.item() * data.size(0)

# Print the epoch loss
epoch_loss = total_loss / len(train_loader.dataset)
print(
"Epoch {}/{}: loss={:.4f}".format(epoch + 1, num_epochs, epoch_loss)
)

# Return the trained model
return model

Finally, we can generate some images:

Notice that images look blurry. That happens because we use MAE for reconstruction quality control.

One last thing that is quite fun to do is to generate a random vector, then gradually change one of it’s components while keeping other components fixed. That way we are able to see how the output of the decoder changes. Here are a few examples:

Ignore the labels captions. Notice how we generated a vector, then by changing one of it’s components we moved from 0 to 9, and then to 1 or 7.

And sometimes there is not change.

Conclusion

In conclusion, implementing autoencoders from scratch is a useful exercise to understand the basics of unsupervised learning and data compression. While simple autoencoders can reconstruct images, they struggle with generating new data. Variational autoencoders (VAEs) offer a more flexible approach by learning parameters of a distribution of the latent space that can be sampled to generate new data. The re-parametrization trick is used to make the sampling operation differentiable. VAEs offer an improvement over simple autoencoders and have many applications, such as generative modelling, anomaly detection, and feature learning.

--

--