An Overview of ESPCN: An Efficient Sub-pixel Convolutional Neural Network

8 min readApr 17, 2020

Figure 1: A super-resolution example of ESPCN; Left: low-resolution image, Middle: original Image, Right: super-resolution result, Upscaling factor: 3

Introduction

Single image super-resolution (SISR) is one of the important contents in image restoration which aims to recover a high-resolution (HR) image from its corresponding low-resolution (LR) image. For instance, in a camera surveillance system, sometimes it is hard to recognize a person due to the low resolution of the human face. Other than face recognition, super-resolution (SR) applications can often be found in areas such as HDTV, medical imaging, satellite imaging, etc.

In recent years, there are several SR models based on deep neural networks have achieved great success in both computational performance and accuracy of the reconstructed super-resolution images. For instance, convolutional neural network SR approaches such as Super-Resolution Convolutional Neural Network (SRCNN), Fast Super-Resolution Convolutional Neural Network (FSRCNN), and Very Deep Super Resolution (VDSR); generative adversarial network SR approaches such as Super-Resolution GAN (SRGAN); and the efficient sub-pixel convolutional neural network SR approach ESPCN. The purpose of this post is to review the article “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network” and to show my understanding of ESPCN.

1. Difference between CNN Approaches and ESPCN

Approaches based on the convolutional neural network like SRCNN, FSRCNN, and VDSR have some drawbacks. Firstly, CNN approaches need to use interpolation methods to upsample the LR image, such as bicubic interpolation. Secondly, it increases the resolution before or at the first layer of the network. In other words, the CNN approach applies the convolutional neural network directly to the upsampled LR image, which will increase the computational complexity and memory cost.

In order to solve these problems, the new approach ESPCN has been proposed to add an efficient sub-pixel convolutional layer to the CNN network. ESPCN increases the resolution at the very end of the network. In ESPCN, the upscaling step is handled by the last layer, which means the smaller size LR image is directly fed to the network. Thus, there is no need to use the interpolation method. The network is capable of learning a better LR to HR mapping compared to an interpolation filter upscaling before feeding into the network. Due to the reduced input image size, a smaller filter size can be used to extract features. The computational complexity and memory cost is reduced so that the efficiency can be greatly enhanced. This is why ESPCN become an ideal choice for the super-resolution of HD videos in real-time.

2. Network Structure

Basically, the SR model assumes input data to be an LR blurred and noisy image. The LR images can be created by performing a downsampling process on HR images from the datasets. And the output will be the reconstructed SR images with specified upscale factor.

The network structure of ESPCN can be represented in Figure 2. Suppose there are L layers for the network, the first L-1 layers are convolutional layers which obtain feature maps of the input LR images. And the last layer is the efficient sub-pixel convolutional layer to recover the output image size with a specified upscale factor.

Usually, the network has 3 layers, as shown in Figure 3:

The input image with shape [B, C, N, N]
First layer: convolutional layer with 64 filters and the kernel size of 5 × 5, followed by a tanh activation layer.
Second layer: convolutional layer with 32 filters and the kernel size of 3 × 3, followed by a tanh activation layer.
Third layer: convolutional layer with the fixed number of output channel C × r × r and the kernel size of 3 × 3.
Apply the sub-pixel shuffle function so that the output SR image will have the shape [B, C, r × N, r × N], followed by a sigmoid activation layer.

3. Sub-pixel Convolution

One of the most important concepts proposed by the paper’s authors is sub-pixel convolution, also known as pixel shuffle. Before understanding sub-pixel convolution, it is necessary to familiar with the concept of sub-pixel. In the camera imaging system, the image data obtained by the camera have been processed by a kind of discretized processing method. Due to the limitation of the light sensor, images are limited to the original pixel resolution, in other words, each pixel on the images represented a small area of color in the real world. In the digital image we saw, pixels and pixels are connected together, while in the microscopic world there are numbers of tiny pixels between the two physical pixels. Those tiny pixels are called sub-pixels.

As Figure 4 shown, each square area surrounded by four little red squares is the pixel in the imaging plane of the camera, the black dots are sub-pixels. The accuracy of sub-pixels can be adjusted depending on the interpolation between the adjacent pixels. In this way, the mapping from small square areas to big square areas can be implemented through sub-pixel interpolation.

Based on this theory, the sub-pixel convolution method can be used in the SR model to obtain high-resolution images. In general deconvolution operation, we pad the images with zeros and then do the convolution, which can be bad for the result. While performing pixel shuffle at the last layer of the network to recover the LR image does not need padding operation. As shown in Figure 5, combining each pixel on multiple-channel feature maps into one r × r square area in the output image. Thus, each pixel on feature maps is equivalent to the sub-pixel on the generated output image.

Sub-pixel convolution involves two fundamental processes: a general convolutional operation followed by the rearrangement of pixels. The output channel of the last layer has to be C × r × r so that the total number of pixels is consistent with the HR image to be obtained. In the ESPCN network, the interpolation method is implicitly contained in the convolutional layers, it can be learned automatically by the network. Since the convolution operations are implemented on smaller size LR images, the efficiency is much higher.

4. Loss Function

According to the paper, input LR images are generated by downsampling the HR images in the dataset. During the training process, the original HR images will be ground truth data. The mean squared error (MSE) is used to measure the difference between the generated SR images and the ground truth HR images. The pixel-wise MSE loss function of the network is:

Here I(HR) represents each original image in the dataset; I(LR) represents each downsampled LR image; r represents the upscaling factor; H represents the image’s height value; W represents the image’s width value, W(1:L) represents all the learnable network weights and b(1:L) represents all the learnable biases.

5. Results

5.1 Image Super-resolution Results

The authors use the peak signal-to-noise ratio (PSNR) as the performance metric to evaluate different SR models. As shown in Figure 7, by using ReLU as the activation function and a small training set with 91 images, SRCNN’s standard 9–1–5 model and ESPCN model perform almost the same, both of them have an average PSNR 27.76 dB. While ESPCN with ReLU trained on ImageNet images performs better overall than the SRCNN model. The authors also evaluated the effect of tanh activation function of different models. Results suggest that tanh function with an average PSNR 27.82 dB performs better than ReLU function with an average PSNR 27.76 dB trained on 91 images.

Figure 7: Results of SRCNN and ESPCN by using ReLU as the activation function

The authors also compare the ESPCN model with the SRCNN 9–5–5 ImageNet model and the Trainable Nonlinear Reaction-Diffusion Model (TNRD). As shown in Figure 8, the results show that the ESPCN model is significantly better than the SRCNN 9–5–5 ImageNet model, and performs close to, and in most cases performs better than the TNRD on different datasets. Figure 9 shows some visual comparison results of those SR models.

Figure 8: Results of different SR models with an upscaling factor of 3 and 4

Figure 9: Some SR visual results for “14092”, ”335094” and ”384022” from BSD500 with an upscaling factor of 3.

As mentioned before, one of the advantages of the ESPCN model is the computational complexity is reduced. The authors evaluated different SR model’s run time on Set14 with an upscaling factor of 3. The results are presented in Figure 10, the ESPCN model runs fastest. They have achieved an average speed of 4.7ms for recovering one single image from Set14 on a K2 GPU.

Figure 10: Comparision of the speed of different SR models, Ours is the ESPCN model

5.2 Video Super-resolution Results

For video SR examples, the authors compare ESPCN models against single frame bicubic interpolation and SRCNN. The results are shown in Figure 11. ESPCN model performs better and faster than the SRCNN 9–5–5 ImageNet model. To compare the speed of SR models, the authors evaluated the run time of 1080 HD video SR results using videos from the Xiph and the Ultra Video Group database. With the upscaling factor of 3, the SRCNN model takes 0.435s per frame while the ESPCN model takes only 0.038s per frame.

Figure 11: Left: Results of HD videos from Xiph database. Right: Results of HD videos from Ultra Video Group database

6. Conclusion

ESPCN can be seen as an upgraded version of the convolutional neural network SR model. In ESPCN, the network is a combination of several convolutional layers and a sub-pixel convolutional layer, the LR image is upscaled at the last pixel shuffle stage. Thus, a big advantage of ESPCN is that it has a higher computation speed. Meanwhile, it has good performance on recovering HR images and videos compared to other CNN SR models.

References

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueckert, Zehan Wang. (2016) Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Andrew Aitken, Christian Ledig, Lucas Theis, Jose Caballero, Zehan Wang, Wenzhe Shi. (2017) Checkerboard artifact-free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize