Autoencoders: Basics and Beyond

Published in

𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

6 min readJun 20, 2024

Hello fam! In this blog our main goal is to learn autoencoders and build them from scratch. This implementation uses PyTorch so if you are not familiar with it, I’ll recommend this YouTube video by Daniel Bourke, its extensive but worth it. Lets get right into it.

What is an Autoencoder?

Autoencoders are a type of artificial neural network used primarily for unsupervised learning. They are designed to learn a compressed, or latent, representation of data and then reconstruct the original data from this compressed form.

To make it simple , lets just call it as a neural network that can learn to recreate images on which they are trained. For ex: considering a Mnist data , autoencoders can learn to recreate the images of handwritten digits from its latent/compressed representation.

The basic architecture of an autoencoder consists of the following components:

Input Layer: Takes the original data as input.
Encoder: A series of layers that compress the input data into a smaller, latent representation.
Latent Space: The compressed representation of the input data, also known as the bottleneck layer.
Decoder: A series of layers that reconstruct the original data from the latent representation.
Output Layer: Produces the reconstructed data, ideally similar to the original input.

Below I have attached an example of an original image and the result of the image after it was passed through the autoencoder model. Before we start coding , I want you to see what we are trying to achieve.

An image of Digit “8” before and after passed into an Autoencoder

Objective

The primary objective of an autoencoder is to minimize the difference between the input and the reconstructed output. This difference is often measured using a loss function such as Mean Squared Error (MSE) or Binary Cross-Entropy (BCE).

Types of Autoencoders

Vanilla Autoencoder: The simplest form with a single encoder and decoder network.
Denoising Autoencoder: Trained to remove noise from the input data, improving robustness.
Sparse Autoencoder: Encourages sparsity in the latent representation, making the model learn more meaningful features.
Variational Autoencoder (VAE): Incorporates probabilistic approaches for generating new data similar to the training data.
Convolutional Autoencoder: Uses convolutional layers, suitable for image data, to capture spatial hierarchies.

Above mentioned are all the advancements that were made in the autoencoder architecture. You don’t need to worry about them now.

Applications of Autoencoders

Autoencoders have a wide range of applications:

Dimensionality Reduction: Reducing the number of features in data while preserving important information, similar to Principal Component Analysis (PCA).
Image Denoising: Removing noise from images to improve their quality.
Anomaly Detection: Identifying unusual patterns or outliers in data, useful in fraud detection and network security.
Data Compression: Compressing data into a smaller size for efficient storage and transmission.
Feature Learning: Automatically learning useful features from raw data for tasks like classification or clustering.

Building an Autoencoder from scratch

Now lets move into action and build an autoencoder !

Import libraries

Lets start by importing necessary libraries. In this example ,we will be
using MNIST dataset and in the end compare reconstructed output with the original image.

import torch
from torch import nn
from torchvision import datasets , transforms
from torch.utils.data import Dataset, DataLoader
import numpy as np
import matplotlib.pyplot as plt

device = 'cuda' if torch.cuda.is_available() else 'cpu'

Downloading dataset

Torch vision has a large dataset collection , we will be downloading MNIST dataset directly from there, and divide it into train and test set.

train_dataset=datasets.MNIST('./ae/data', train=True, download=True, transform=transforms.ToTensor())
test_dataset=datasets.MNIST('./ae/data', train=False, download=True, transform=transforms.ToTensor

Preparing Data Loaders

Data Loaders simplify data handling by automatically batching, shuffling, and loading data in parallel, boosting training efficiency. They support on-the-fly data transformations, ensuring consistent and reproducible preprocessing. By managing large datasets efficiently, DataLoaders prevent memory overload, facilitating the training of complex neural networks. So lets prepare them as well to make our life easy.

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=True)

Creating the Autoencoder Neural Network

Let’s move on the my favorite part, creating the neural network. To implement autoencoders we discussed that we need to reduce the image to a latent space, but how do we do that? In this approach we will be reducing the number of neurons per layer as we move forward. The standard size of an Mnist dataset image is 28x28=784 , as we move forward in the in the encoder part, we will reduce the number of neurons per layer until we reach 9 neurons in the last layer. These numbers are never fixed and you can play around them as you like, just make sure your decoder should be exactly opposite of the encoder. Imagine the encoding part as reducing the image size every layer and decoding part as increasing the size of the image every layer. The architecture that we are going to build is visually represented below.

class AutoEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 512),
            nn.ReLU(),
            nn.Linear(512, 302),
            nn.ReLU(),
            nn.Linear(302, 124),
            nn.ReLU(),
            nn.Linear(124, 84),
            nn.ReLU(),
            nn.Linear(84, 60),
            nn.ReLU(),
            nn.Linear(60, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 9), #latent space representation 
        )
        self.decoder = nn.Sequential(
            nn.Linear(9, 16),
            nn.ReLU(),
            nn.Linear(16,32),
            nn.ReLU(),
            nn.Linear(32, 60),
            nn.ReLU(),
            nn.Linear(60, 84),
            nn.ReLU(),
            nn.Linear(84, 124),
            nn.ReLU(),
            nn.Linear(124, 302),
            nn.ReLU(),
            nn.Linear(302, 512),
            nn.ReLU(),
            nn.Linear(512, 28*28),
            nn.Sigmoid(),
        )
        
    def forward(self,xx):
        return self.decoder(self.encoder(xx))

We defined the class and built the autoencoder architecture. Note that this architecture on every layer reduces the size of the image(number of neurons per layer) and does the exactly opposite (increasing neurons every layer) in the decoder part. Also note that we use sigmoid activation in the last layer. The sigmoid function maps input values to a range between 0 and 1. Since the pixel values of images in the MNIST dataset are normalized to this range (0 to 1), using sigmoid ensures that the output of the autoencoder matches this range, making the reconstructed images consistent with the original data.

Setting up the Model, Loss function and Optimizer

model = AutoEncoder().to(device)
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Here we created a model object of the AutoEncoder class. We set the loss function to Mean Squared Loss and optimizer to everyone’s favorite Adam.

Training Loop

epochs = 15
losses=[]
for epoch in range(epochs):
    for data in train_loader:
        inputs,_ = data 
        inputs = inputs.view(-1,28*28)
        inputs = inputs.to(device)
        outputs = model(inputs)
        loss = loss_fn(outputs,inputs)
        losses.append(loss.item())
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
    print(f'Epoch {epoch+1} | loss: {sum(losses)/len(losses):.4f} ')

Here the training loop runs till 15 epochs. Running through every batch once counts in as 1 epochs and so on.

Test Loop

model.eval()
for images, labels in test_loader:
    plt.imshow(images[0].squeeze(),cmap='gray')    
    plt.title("Original Image")
    plt.show()
    with torch.inference_mode():
        outputs = model(images[0].view(-1,28*28))
        outputs = outputs.view(28, 28).numpy()
        plt.imshow(outputs.squeeze(),cmap='gray')
        plt.title("Reconstructed Image")
        plt.show()
    break

In this test loop, the code takes an image from one of the batches in the test loader, plots its original image , passes through the model to get output and plots the reconstructed image(output).

Here are the results:

Well, that is a pretty decent result. Looks like these architectures are really good in reconstructing the image. The result will only get better and better if you look up more advanced architectures. If you followed this blog till now, Congratulations! you maybe just created your first autoencoder! That’s it for this blog , if it helped you in anyway , I’m happy ;).

Will see you soon.

Arigato!