Autoencoders. Practical use for image denoising, image recovering and new image generation

8 min readNov 28, 2023

Autoencoders are type of a deep learning algorithm that performs encoding of an input to a compressed representation and decoding of the compressed representation to the same or different modification of the input. In this article I discuss picture-to-picture autoencoders and their practical use. A general scheme of an autoencoder is in the picture below:

The input image with the resolution (m x n) is encoded to its code — latent representation. The size of the code should be significantly lower than the image size (for RGB format the source image size is equal to 3 * m * n values). It is expected that the encoder should be able to extract the essential image features and represent them in the code. The decoder decompresses the code to the output image. When working with images, it is reasonable to use CNN architectures for encoder and decoder parts.

Autoencoder for image denoising

The goal of a denoising autoencoder is to transform noisy image to its “ideal” form. I’ve seen many examples of denoising autoencoders for handwritten digits from MNIST (Modified National Institute of Standards and Technology). Earlier, I had made augmentation for license plates dataset from Kaggle and implemented a working prototype for license plates number recognition system. Here I’m using these augmented and clean images of license plates numbers (LPN) to train my autoencoder to transform augmented images to the accordant clean images. Pictures below show examples of noisy and target images:

One “ideal” image may be a target for several “noisy” images with the same LPN and different types of augmentation. The link between “noisy” and “ideal” images is set by LPN sub-string in images unique file names. The resolution of LPN images for my model is 50x200 pixels. The encoder contains 2 down-sampling blocks, the decoder — 2 up-sampling blocks, and the dimension of latent representation — 12x50 (600 parameters). Here is my autoencoder model (PyTorch is used) which I trained for denoising:

class Auto_Encoder(nn.Module):
    def __init__(self):

        super(Auto_Encoder, self).__init__()
        nc = 128
        nc4 = int(nc / 4) # 32

        self.enc = nn.Sequential(
            # 50x200, 3 channels
            nn.Conv2d(3, nc, kernel_size=5, stride=1, padding=2),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),
            # 25x100
            nn.Conv2d(nc, nc4, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size = 2, stride = 2),
            # 12x50
            nn.Conv2d(nc4, 1, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
        )
        self.dec = nn.Sequential(
            # 12x50
            nn.Conv2d(1, nc4, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),
            # 24x100
            nn.Conv2d(nc4, nc, kernel_size=3, stride=1, padding=1),
            nn.ReLU(inplace=True),
            nn.Upsample(scale_factor=2, mode='bilinear', align_corners=True),
            # 48x200
            nn.Conv2d(nc, 3, kernel_size=5, stride=1, padding=(3, 2)),
            # 50x200
            nn.Sigmoid()
        )

    def forward(self, x):
        encoded = self.enc(x)
        decoded = self.dec(encoded)
        return decoded

I used Adam optimizer, and variable learning rate started from 0.001 and changed to 0.0001 during the training. Loss function — Binary Cross Entropy. The number of LPN images in the training set — 9904, in the test set — 1104. I trained the model during 200 epochs.

Pictures below show the results from the trained model on some test images — input noisy image + reconstructed by trained model image:

I used Binary Cross Entropy for both loss function and quality measure. On the test batch shown in the pictures above, the mean Binary Cross Entropy is equal 0.2009, while on the whole test set (1104 images), the mean Binary Cross Entropy is equal 0.2016, so the test batch has the same quality as the whole set.

Autoencoder for pure image reconstruction

The same model of autoencoder as described above might be used for pure image reconstruction, i.e. encoding/decoding the input image to itself. At a first glance it looks strange — we make efforts to get the output which is identical to the input. The goal of this kind of autoencoders is to be part of new image generators. With the model described above, I used two clean LPN datasets from Kaggle — LV and GB license plate numbers images — as training/test data. The inputs=targets examples are in the picture:

The number of LPN images in the training set — 2552, in the test set — 200. I trained the model during 200 epochs.

Pictures below show the results from the trained model on some test images — input LPN image + reconstructed by trained model LPN image:

I used Binary Cross Entropy for both loss function and quality measure. On the test batch shown in the pictures above, the mean Binary Cross Entropy is equal 0.2030, while on the whole test set (200 images), the mean Binary Cross Entropy is equal 0.2128, so the test batch has the same quality as the whole set.

Autoencoder as a generator in generative adversarial network (GAN)

I used the trained model described in the previous part for the following steps:

1. I applied model.enc part to the input image with LPN and got the tensor L — the latent representation containing 12x50=600 values.

2. I got L.mean and L.std

3. I made relatively small random changes of the tensor L values (based on its mean and std) and got a new tensor L1

4. I applied model.dec part to the new tensor L1 and got an output image

So, the pipeline of the pure autoencoder looks like

model.enc(input) -> L ->model.dec(L)->output

the pipeline of the autoencoder with the latent representation variation looks like

model.enc(input) -> L-> L1 ->model.dec(L1)->output_var

An example of this variation follows:

In this example, an autoencoder trained for pure reconstruction is used to generate a variation of an input image and thus obtain another image of the same class. The image variation is forced by small changes to its latent representation.

The example of GAN which is used autoencoder as a generator is VQGAN. Also it contains another part which uses pre-trained autoencoder and is trained to change the latent representation for images transformation. With an emphasis on autoencoders, we can say: we need autoencoder for the reconstruction of any image with high quality to generate new images which are recognized as adequate representations of particular classes.

How to recover any image with high quality?

In the model which I implemented for my autoencoder for LPR, the input image is encoded to 600 values, then the 600-values representation is decoded to the output image. The model is trained with the loss function trying to minimize the difference between the input and output images. As we can see in the pictures above, the quality of reconstruction even for the simple LPR dataset is not perfect. What we should expect reconstructing any image?

Looking into the VQGAN autoencoder characteristics I’ve found that it uses 256x256 input/output image resolution and has 4 or 3 down-sampling blocks in the encoder.

I implemented autoencoder with 3 down-sampling blocks in the encoder + 3 up-sampling blocks in the decoder and tried to train it to reconstruct 256x256 images from Kaggle dataset with cars — see the examples:

The latent representation — 32x32=1024 values. The model is trained with the loss function trying to minimize the difference between the input and output images.

The results are much less than perfect:

The quality of reconstruction is not good even for simple dataset with cars only. It prompted me to study VQGAN deeper to find the secret: how to reconstruct any image with high quality.

A good explanation of VQGAN architecture I’ve found here and here. The latent space is not small: each of entry of down-sampled code with the resolution 16x16 or 32x32 is a vector with length=256, i.e. the encoded output contains 256 plains with the resolutions 16x16 or 32x32. An interesting trick is a usage of codebook which is the same for all images, in addition to the latent representation got as encoding result of a particular image. The codebook entries are vectors with length=256. After the encoding step each entry of the latent representation plain changed to the nearest vector from codebook and then this new latent representation from chosen codebook entries is used in decoding step. The process of codebook-entries selection to change the latent representation vectors is called vector quantization.

Picture below shows the process in VQGAN autoencoder (the picture is from the article):

It is important that training for encoding/decoding goes simultaneously with codebook entries training. The loss function requires minimization both difference between input and output images and difference between latent representation vectors just after encoding and codebook vectors chosen in vector quantization step. Instead of encoder and decoder training to extract unique features of each image they trained to find the image features closest to common centroid values, and these centroid values also finding during the training. VQGAN autoencoder models with following different settings were trained:

1. The small model with the latent representation 16x16x256 and codebook containing 1024 entries

2. The model with the latent representation 16x16x256 and codebook containing 16384 entries

3. The model with the latent representation 32x32x256 and codebook containing 8192 entries

The resolution of the input /output image is 256x256.

Pictures below show the results of image reconstruction using the smallest VQGAN model (codebook with 1024 entries). To get these results I used parts of code from this google colab.

The larger codebook — the better similarity.

The landscapes:

Conclusion

1. Autoencoder for image reconstruction implemented as a simple “compressor + decompressor” with a small latent space works well only with sets of very similar images.

2. VQGAN autoencoder uses codebook and vector quantization technics and a large latent space. All these things allow to reconstruct any image with high quality. Training codebook leads to latent representations clusterization which is important for Variational Autoencoder.

3. Taking an impact to the latent representation it is possible to generate new images. Latent representation might be used in transformers training.