Improving PewDiePie’s camera quality with Autoencoders

Let’s take a look at how we can use Deep Learning for Image Super-Resolution with Autoencoders.

Published in

deepgamingai

5 min readJun 30, 2020

Comparison of the 480p input (left) to an Autoencoder trained for the task of image super-resolution, with it’s higher quality output at the same resolution (right).

Recently, I have been reading about various image super resolution techniques that utilize Deep Learning for improving image clarity. Some very impressive results have been achieved using techniques like GANs and Autoencoders for this task. It is safe to presume that most smartphone cameras and other image processing software these days make use of such AI to “enhance” images.

In this article, I would like to explore and detail how effective Autoencoders are for this task and demonstrate some results on a recent video of PewDiePie.

Why PewDiePie?

If you have been following the most-subscribed YouTuber recently, you would know he faces a lot of criticism and ridicule for having low visual quality of his videos in spite of using expensive camera gear for recording videos.

This made me think it would be the prefect use-case to play with the super-resolution AI algorithms and see how much better quality of videos we can achieve with them.

What is Image Super-Resolution?

The technique by which a low resolution (LR) blurry image can be up-scaled to output a sharper and more-detailed higher resolution (SR) image is termed as single image super resolution. The aim is to recover information of the objects in the image that has been lost due to poor camera quality or poor lighting conditions.

An example of image super resolution using Neural Networks (ESRGAN). [Source]

Convolutional Neural Networks (CNNs) have proven to be rather good at such tasks, especially compared to the more traditional techniques of interpolation. With their ability to learn about shapes and textures of common objects, CNNs are very effective in recovering information that may otherwise not even be present in the LR images. So, let’s take a look at how we can train a CNN based autoencoder for this task.

Gathering Training Data

First, let’s take a look at our training data. We will use a pair of Low Resolution — High Resolution (LR-HR) images of Pewds for training our Autoencoder network.

It is fairly easy to gather this type of data even though it is Supervised Learning. We only need high resolution images and it is very easy to generate their low resolution counterparts using a simple downscaling followed by upscaling operation on the image, as described in the figure above. This gives us our input-output pairs for network training.

Autoencoder Architecture

The Autoencoder network contains two main blocks — the Encoder and the Decoder. The entire Encoder-Decoder network setup is shown in the following figure.

LR -> Encoder -> Encoding -> Decoder -> SR

The Encoder is tasked with taking in an LR image and finding a high-level representation of this image known as an encoding. It uses various Convolutional layers for this purpose. The output encoding contains information of the shapes and textures of various objects in the image and is of much smaller dimension than the input image. However, the encoding is not really interpretable by a human, but is only a representation that is understood by this particular network.

The task of the Decoder is then to take this encoding and create a higher resolution image. It uses various upsampling blocks for this purpose and also takes help from earlier Convolutional layers of the encoder to recover some of the information in the image that may have been lost while translating to the encoding. This is represented by the overhead arrows in the above figure. Consider this as the decoder asking the encoder about specific details of the original image while it smoothens and sharpens the information it receives.

In the end, the particular network I am using upscales the image back to the same resolution of the input image, but with sharper quality and more details in the pixels. You may try to go beyond the LR resolution as well if interested.

Results

I trained the network using a basic MSE loss metric between the generated image and the ground truth. This is certainly not the best loss metric you can use for comparing images, so there’s lot of room for improvement in that regard.

Anyways, even with MSE I am able to get fairly decent results which leads me to believe that CNN based autoencoders are highly effective for this task. Following are some sample comparisons between the LR input and SR output.

Left: input to autoencoder. Right: output of autoencoder. Both images are 480x480.

For more such results in a video format, please check them out on my YouTube channel with the video embedded below.

Conclusion

While the results from Autoencoders are quite decent, in practice they are not quite state-of-the-art. There have been various other GAN-based approaches used for training the image generator that produce much more realistic looking output. In conclusion, the MSE based loss, or even any other loss we can use here for a pure autoencoder, would be at a disadvantage compared to a GAN’s generator-discriminator setup.

In any case, it is certainly very impressive what we can achieve with AI these days so its worth keeping an eye out on the latest research if you are interested in image super resolution!

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

Note: This is a repost of the article originally published with towardsdatascience in 2018.