Implementing Auto Encoder from Scratch

5 min readAug 24, 2020

As per Wikipedia, An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.

In this tutorial, we’ll implement a very basic auto-encoder architecture on the MNIST dataset in Pytorch. Other than PyTorch we’ll also use PyTorch-lightning to make our life easier, while it handles most of the boiler-plate code.

Step 0. Install the necessary libraries

pip install pytorch_lightning
pip install torchvision

Importing the libraries

import torch
from torch import nn, optim
from torch.utils.data import random_split, DataLoader
from torch.nn import functional as F
from torchvision import transformsimport pytorch_lightning as plfrom torchvision.datasets import MNIST

Step 1. Setting Up LightningDataModule

This is an abstraction in the pytorch_lightning library to handle dataset-specific stuff. Let’s focus on the model architecture for this specific tutorial and not the data module. DataModules will be explained in a future tutorial in a detailed way.

Overview of the duties of DataModule is:

It downloads the dataset, if not already downloaded
Splits it into train, validation and test sets
Make batches of these splits and create DataLoaders for each split.

NOTE: DataLoaders is the way in which PyTorch handles the loading of data into the model during the training process.

Code for Lightning Data Module

Step 2. Create the Model Architecture

To avoid all the boilerplate code that is needed for training a PyTorch model, we’ll take help of Pytorch-Lightning Module.

Basic Idea of Auto Encoders

As mentioned earlier, in this tutorial we’ll create a very basic Auto-Encoders.

We try to generate a lower-dimensional representation of an image, that can be “decoded” to reconstruct the original image back. That is it!

The difference in different architectures of Auto Encoders is:

Method of creating a lower-dimensional representation
Method of reconstruction.

Encoder-Decoder put together gives an Auto-Encoder — Basic Auto-Encoder Architecture

Basic Architecture

Q1. Method of Creating lower-dimensional representation:

Flatten the image i.e, if the image is of size 100X100 it is flattened to the shape of 10,000X1.
Send it to a Dense Layer which takes the flattened shape to the size of the compressed representation

Q2. Decoding Method:

Nothing Fancy, just apply the same transforms but in reverse order.

You extend these two questions to any auto-encoder architectures to understand what is the new or novel way they have introduced.

Q1. How are they creating lower-dimensional representations?

Q2. How are they reconstructing the images back?

For Encoding a batch of images

flattened = image_batch.view(-1, self.flattened_size)
# flattened: [batch_size, flattened_size]representation = F.relu(self.input_to_representation(flattened))
# representation: [batch_size, representation_size]

With this small snippet of code, we have successfully converted the image batch to a dense compressed form, this is exactly what an auto-encoder needs to do.

Now, we have to be able to reconstruct an image from this condensed representation. As this is a basic architecture, there is nothing fancy, we just do the same operations in reverse order.

flat_reconstructed = F.relu(self.representation_to_output(representation))
# flat_reconstructed: [batch_size, flattened_size]reconstructed = flat_reconstructed.view(-1, *self.input_shape)
# reconstructed is same shape as the original image_batch

Our model only has 201K parameters that is much less than the latest models, for a reference, the hyped GPT-3 that people are going crazy about has 175 BILLION PARAMETERS!!

But we are going to show (wait for it…) that in less than 50 seconds of training, our model gives decent results.

Step 3. Train the model

We use 16-bit precision while training, to use less memory (exactly half than the standard 32-bit precision models). Most of the latest GPUs use approximate computing to give better performance for 16-bit float operations, with Nvidia Titan V100 GPU the speed up is 3x to 5x.

Half the memory
3 times faster

Why NOT! Let’s use it...

mnist_dm = MNISTDataModule()model = SimpleAutoEncoder(input_shape=mnist_dm.size(), representation_size=128)trainer = pl.Trainer(gpus=1, max_epochs=5, precision=16)
trainer.fit(model, mnist_dm)

After training the model for 5 epochs these are the results:

Validation Loss: 0.00927

Train Loss: 0.0102

This is a sign of underfitting, but we’re not complaining because this is a very basic architecture of Auto-Encoder.

Training Visualisations from tensorboard

Step 4. Visualising the results

This is very important! Often this overlooked, but this is exactly how you know the model is performing.

This is liked tasting food after cooking:

You get to taste your results.
If something is wrong, you make changes.

for batch in mnist_dm.val_dataloader():
  original_imgs = batch[0]
  outputs = model(original_imgs)  for i in range(len(outputs)):
    # Original Image
    plt.figure()
    plt.imshow(trans(outputs[i]).convert("RGB"))
    # Reconstructed
    plt.figure()
    plt.imshow(trans(original_imgs[i]).convert("RGB"))    if i==3:
      break
    break

The results are very similar to the original ones, this is a very good sign and the model is working as intended.

Conclusion

Not bad for a very basic model trained for 40 seconds, right? Congrats! You’re done with your first Auto-Encoder.

The full code is available in a Jupyter Notebook format at this repo. You can open in it Google Colab to run and check then and there without

Star the repo and Follow me on GitHub. If you find any problems with the code or have suggestions for improvement, add them into Pull Requests or leave a comment.

In the next part of the blog, I’ll implement a Convolutional Auto-Encoder from scratch.