Photo by Adrian Rosebrock on pyimagesearch

Explained like you are 5: Image Super-Resolution Part 1

Published in

Developer Community SASTRA

6 min readMay 10, 2023

The Part 1 of this two-part series discusses the basics of Image Super Resolution, available techniques, models, and architectures that we can use, metrics used to evaluate and score the results, and finally a model design.

Part 2 will demonstrate how to build a simple ISR model. Click here to view part 2.

Image super-resolution is like a magic tool that can transform blurry low-quality images into sharp, high-quality ones. With this, you can enhance the details and improve the overall visual appeal of an image, making it more attractive and eye-catching.

What is image super-resolution?
Need for ISR models
Techniques used for ISR
Metrics for evaluation
Designing a model

What is Image Super Resolution?

Image super-resolution is the process of increasing the resolution or quality of an image. ISR models upscale a low-resolution image to a higher-resolution one (240p -> 720p, 3x upscaling) while maintaining the sharpness the image needs to have. Whether you want to improve the resolution of your photos or enhance the quality of your professional designs, image super-resolution can help you achieve stunning results!

Need for ISR

We all would have seen this meme way back when Facebook was trending. There is a good reason why surveillance cameras record in lower quality. In a facility like a bank, at least several dozen cameras are operating 24/7. In such cases, the space required to store high-resolution recordings would be tremendous. Thus a tradeoff is made between quality and storage. Instead of increasing the storage and quality of surveillance cameras, whenever a high-resolution recording is required, we can use ISR to create high-quality recordings from low-quality ones .

ISR finds its application in places where there is a need to upscale low-quality images to good-quality images , for example restoring old photos, fine tuning movie recordings, medical imaging, etc.

Techniques used for ISR

Simple interpolation techniques

The simplest way to perform ISR is by using interpolation techniques, where every pixel in the output image is the result of a simple predefined set of operations performed on the pixels in the original image. They usually tend to produce blurry images.
Here are computerphile’s videos on how bilinear, nearest neighbor, and bicubic interpolation work:

Deep Learning based approaches

Most pre-trained models use GAN ( Generative Adversarial Networks) architecture, residual blocks, and depth-to-space operation. Let’s discuss those concepts.

GAN- Generative Adversarial Network

GANs consist of two networks: a generator network and a discriminator network. The generator network takes an input and generates a sample that is similar enough to the real data to fool the discriminator.

The discriminator takes a sample generated by the generator or the original data as the input and outputs a binary classification indicating whether it is real or fake. The goal of the discriminator is to classify whether the sample is real or fake.

The generator and discriminator networks are trained simultaneously, where the generator tries to create samples that try to fool the discriminator, while the discriminator tries to correctly classify the samples. The training process continues until the generator produces samples that are indistinguishable from the original data, or until the performance of the discriminator plateaus. The most commonly used pre-trained ISR models are based on GAN architecture. They are SRGAN and ESRGAN. GAN models usually tend to be huge and take a lot of training time, as it has to train two networks simultaneously.

Residual Block

A Residual Block, referred to as ResBlock, implements the skip connection. It consists of a sequence of layers followed by a shortcut connection that skips over the layers and adds the original input to the output of the layers. The idea is that by learning the residual difference between the input and output of the layers, the network can more easily learn useful representations, even in very deep architectures.

Depth-to-space operation

A 2D image can be considered as a matrix. Space represents number of rows and columns of the matrix, Depth is the number of such matrices. If the shape is H x W x D then it means D number matrices of shape H x W.

For example, in an RGB image, there are three channels (H, W, 3), one matrix representing the shade of Red, one representing Green, and one for Blue.

The depth-to-space layer takes information from large amount of smaller matrices and organizes them into a smaller number of larger matrices. It thereby converts the depth dimension to spatial dimension , increasing the resolution of feature maps output from CNN or ResBlock.

Illustration of the depth-to-space operation

After reading all of this, a good ISR architecture will start to take form in your mind. Let’s discuss the final few theoretical concepts and jump straight into creating our ISR model!

Metrics

The pixel values are continuous, so we can simply use the Mean Squared Error loss. But there are drawbacks to using MSE. The main drawback of MSE is that it only measures the difference in pixel values between the reconstructed image and the ground truth image, without taking into account the perceptual quality of the image. As a result, images that are very blurry may still have a low MSE score, even if they look visually unappealing to human observers.

Thus other metrics like Structural Similarity index and Peak Signal-to-Noise Ratio are used in conjunction with MSE. These metrics take into account both the difference in pixel values and the perceptual quality of the image and can provide a more accurate and informative assessment of the quality of the reconstructed image.

These metrics can be quite technical and difficult to understand for non-experts. I will try to explain the key concepts without diving too much into the technical details.

SSIM and PSNR take into account the perceptual quality as well as the difference in pixel values to provide an assessment.
Higher the PSNR value, better the reconstructed image. Usually ranges from 20–50 dB but theoretically, it can go up to infinity.
SSIM ranges from -1 to +1, with -1 indicating the reconstructed image and the original image being perfectly anti-similar, +1 indicating they are perfectly similar, and 0 indicating no similarity.
An extension of the SSIM called the Multi-Scale SSIM (MS SSIM) is an improved version of the SSIM, which is one of the most commonly used metric
MS SSIM ranges from 0 to 1, with 1 being perfectly similar. To serve as an error function, MS SSIM is usually subtracted from 1. So that on training, the model learns to minimize the error.

Designing a model

Let us design a simple model based on the information we have gathered.

Add an input layer with the required shape, H x W x D
Add a few consecutive ResBlocks
Each ResBlock contains one or two Conv layers. The inputs are concatenated with the output of the Conv layers to finish up the skip connection
Add depth-to-space operations which convert the depth-heavy output of the ResBlocks into a more space-oriented output
Add convolutional layers to get output in the required shape, this will serve as our generator
Design a simple Image classification model to use as the discriminator
Compile the GAN and train with different loss functions for the generator such as MSE, PSNR, and MS SSIM
Pick the loss function which allows the generator to learn the most

To see an implementation of the same, look at part 2 of my “Explained like you are 5: Image Super-Resolution “ series. But before that, I would recommend you code this out on your own!