Super Resolution and its Recent Advances in Deep Learning — Part 2

Published in

Analytics Vidhya

7 min readAug 31, 2020

Hi and welcome to part 2 of the super-resolution series (You can find the links to the remaining parts at the end). Now that we have a background of what Super Resolution is and a little introduction to the concept of deep learning, let's dig into more interesting concepts without further ado!

As we were discussing in the earlier article, our deep learning model needs a dataset containing Low resolution(LR) images and High resolution(HR) images to adjust its parameters and adapt itself to this particular image environment. There are typically two ways to create such a dataset — Supervised and Unsupervised Super-resolution. In this article, let’s focus on supervised super-resolution in detail.

Supervised Superresolution

In this approach of training a deep learning framework, mapped pairs of LR and HR images are considered in the training data i.e. we teach the network that the HR image is the expected super-resolved image (network output:f(x)) to a particular LR counterpart (input:x).

Deep Learning — A background

From the previous post, we understood the purpose of deep learning to learn a complex function equation. Now, let’s see how it’s actually performed. A neuron is the smallest component of Neural networks which is the heart and soul of this domain. A single neuron applies two functions on the inputs(X) and weights(W) given to it and returns the resulting output. A dot product between the weights and inputs is the first function. To introduce non-linearity into the model, a suitable activation function is applied to the previous dot product which becomes the final output of a neuron.

Working of a neuron; Source: Neural networks

Different layers of such neurons stacked one after the other forms a neural network. Each neuron’s output in one layer is sent to every neuron in the next layer which is then multiplied by the corresponding weight (In the animation below, visualize the output of a neuron passing through each weight/edge in the network). To simplify it, the whole set of weights of a neural network is similar to the coefficients like a,b in a simple linear equation case f(x) = ax+b which we are generally concerned about estimating accurately.

Visualizing a simple cat/dog classifier; Source: Article

Convolutional Neural Network(CNN) is a type of neural network typically used in the context of images. Though images are just arrays of pixel intensity numbers, treating these values independently implies that we are ignoring the spatial dependencies among them. So, CNNs were designed to solve this issue.

Researchers have come up with various supervised models based on CNNs which might be unique on their own but they essentially have a few backbone components like model frameworks, upsampling methods, network design, and learning strategies.

Why don’t we just dive into how each of these work?

Model Frameworks

Remember that our first idea of super-resolving an image was to increase the dimensions of an image, and then fill the empty spaces with meaningful pixel intensities. Similarly, at some point in a CNN network architecture (stack of neuron layers), few layers are assigned for upsampling the input image to match the dimensions of the expected output image. This can be achieved in different ways listed below:

(i) Pre-upsampling Super-resolution

Pre-upsampling SR; Source: DeepSR Survey

In our earlier post, we’ve seen that simple interpolation methods were generating blurry images with coarse textures. So, why don’t we just build a network on top of this interpolated image and train our network parameters (weights) to closely approximate it to the original HR image? This approach was initially one of the most popular frameworks as the difficult upsampling operation is done at the beginning itself. But, it turned out that this routine often introduced side effects like blurring and noise amplification. Also, all our convolutional network layers work on the huge upsampled image matrix which makes our computation inefficient.

(ii) Post-upsampling Super-resolution

Post-Upsampling SR; Source: DeepSR Survey

Since we found pre-upsampling computationally expensive, let’s try to move it to the end, which implies that all the features of an image are extracted in the convolutional layers before the final upsampling. As the computational and space complexities have now reduced, this approach has become extremely popular.

(iii) Progressive Upsampling Super-Resolution

Progressive Upsampling; Source: DeepSR Survey

One of the simplest upsampling factors is 2x (i.e. 30x30 Pixels to 60x60 Pixels). But let’s say we’re interested in achieving higher factors like 4x, 8x, etc. In such cases, the network finds it complex to adjust its parameters to upsample the whole image in one step. Let’s make the task simpler for the network by adding smaller upsampling layers at different points. If we desire to generate an 8x super-resolved image, progressively adding 2x upsampling layers is preferable rather than adding an 8x layer in the end. However, this approach requires advanced training strategies and complicated multi-stage model designing to ensure overall training stability.

(iv) Iterative up-and-down Sampling Super-Resolution

*Iterative up-and-down Sampling SR;* Source: DeepSR Survey

This is one of the recent frameworks introduced in this domain and is believed to have a great scope of exploration and potential. From the figure, it’s clear that this approach iteratively adds upsampling and downsampling layers in various parts of the network. Researchers observed that such a framework captures the mutual dependency between LR and HR image more accurately and thus provided higher quality reconstruction results. Networks like DBPN, SRFBN, RBPN, etc have experimented with this framework.

Upsampling Methods

We have seen different approaches to introduce an upsampling layer in our CNN. But how exactly is this layer designed?

(i) Interpolation-based Upsampling

These are the traditional interpolation methods i.e. Nearest neighbor upsampling, Bilinear interpolation, and Bicubic interpolation I introduced in the first part and are easy to implement. Among these, the Nearest neighbor interpolation is the fastest but creates unintended block-like artifacts. On the other hand, Bicubic generates a relatively smooth image but is computationally slow.

However, these traditional algorithms are based on an image’s own pixel intensities and don’t capture other useful image information. They typically tend to introduce noise amplification and blurring results. So, researchers have upgraded to learnable upsampling layers from interpolation-based methods.

(ii) Learning-based Upsampling

Transposed Convolution Layer:

**Transposed convolution layer**. The blue boxes are the input, and the green boxes are the convolution output; Source: DeepSR Survey

Consider a case where we want to upsample an image from 3x3 to 6x6 pixels. Firstly, we add zero-valued pixels in an alternate-fashion, which are represented by white boxes. Then, we pad the boundaries of the image with zero values to facilitate the application of convolution on this image. The 3x3 green matrix(a.k.a. kernel) in the convolution stage is a filter applied on top of the image matrix which is just a dot product of the numbers [3x3 matrix from image dot product with 3x3 kernel matrix gives a 1x1 final green box. This moves over the whole image to create the final output. Check this animation for better visualization]. The final image matrix after the convolution is now 6x6. But this layer tends to create a checkerboard-like pattern of artifacts in a few cases.

Sub-pixel Layer:

**Sub-pixel layer.** The blue boxes are input, and the boxes with other colors indicate different convolution operations and different output feature maps; Source: DeepSR Survey

In this case, instead of alternately adding zero values and using one filter/kernel, we try to use 4 kernels for 2x upsampling after zero-padding the boundaries. In the reshaping phase, the 1x1 outputs from each of the corresponding 4 dot products are combined together to form a 2x2 matrix which becomes a part of the final upsampled output. These filters are moved over the whole matrix similar to the previous case to obtain multiple such 2x2 matrices. Compared with the transposed convolution layer, the sub-pixel layer provides more contextual information to help generate more realistic details.

References

Links to all the parts:

Stay tuned, more parts coming soon :)