Review: ESPCN — Real Time SR (Super Resolution)

In this story, ESPCN, by Imperial College London, is shortly reviewed. Super Resolution (SR) is a class of techniques that enhance/increase the resolution of images. In ESPCN, the low-resolution (LR) image is only upscaled at the very last stage as the high-resolution (HR) image. Thus, the number of computation can be reduced for the network because small-size feature maps are used. Consequently, real-time performance can be achieved. And it is published in 2016 CVPR with more than 400 citations. (Sik-Ho Tsang @ Medium)

Sometimes, we only got a poor image and we want to have digital enlargement (zoom in), but the image gets blurred when zoomed in. This is because the conventional interpolation or enlargement of a small image to become a large image, will get a poor image quality. With ESPCN, we can obtain a high-resolution (HR) image with high quality from a low resolution (LR) image.
ESPCN (Ours) Is At The Left Top Corner Which Is Much Faster and Better Than SRCNN

What Are Covered

  1. Problem of Some Conventional SR Approaches
  2. FSPCN (Efficient Sub-Pixel Convolutional Neural Network)
  3. Results

1. Problem of Some Conventional SR Approaches

Convolutional neural nework (CNN) approaches such as SRCNN, FSRCNN and VDSR

  • Firstly upscale/upsample the LR image
  • Then perform convolution to get the HR images

Since the LR image is upsampled at the very beginning, all the convolutions will be based on the upsampled LR image. Thereby, the number of computation is increased.


2. FSPCN (Efficient Sub-Pixel Convolutional Neural Network)

FSPCN Network Architecture

Suppose there are L layers for the network,

  1. For the first L-1 layers, the input LR image goes through fl×fl convolution and obtain nl-1 feature maps.
  2. At the last layer, an efficient sub-pixel convolution is performed to get back the HR image at the output.

Specifically, L=3 which means it is a shallow network.

And the parameters for each layer are: (f1,n1)=(5,64), (f2,n2)=(3,32) and f3=3.

  • 1st layer: There are 64 filters with the filter size of 5×5.
  • 2nd layer: There are 32 filter with the filter size of 3×3.
  • 3rd layer: There is only 1 filter with filter size of 3×3. This is because for a YUV image, only Y is considered as human eyes are more sensitive to luminance than chrominance.

3. Results

3.1 ReLU as Activation Function

Results with ReLU as Activation Function
  • With only 91 images for training, ESPCN has nearly the same performance (27.76dB) with SRCNN.
  • With imagenet images for training, ESPCN has better performance (28.09dB) than SRCNN (27.83dB).

3.2 Tanh as Activation Function

Results with Tanh as Activation Function
  • With tanh as activation function, a higher average PSNR of 28.11 dB is obtained for upscaling factor of 3.
  • PSNR of 26.53 dB is obtained for upscaling factor of 4.
  • Authors argue that FSPCN provides more feature maps for upsampling while SRCNN only upsamples the input image with single bicubic interpolation.

3.3 Video

As FSPCN is a very fast super resolution approach since it is shallow network, video is also tested.

Video Dataset
  • Consistently, for both Xiph and Ultra Video Group dataset, ESPCN obtains a bit higher PSNR than SRCNN.
  • Though the quality is very similar, the speed is differ very largely.
  • With upscale factor of 3, SRCNN takes 0.435s per frame whilst ESPCN model takes only 0.038s per frame.
  • With upscale factor of 4, SRCNN takes 0.434s per frame whilst our ESPCN model takes only 0.029s per frame.

3.4. Visual Quality

Some Visual Results for Set14
Some Visual Results for BSD500

With such high speed, about 0.029–0.038 seconds per frame, over 26-33 frames per second (fps), it is useful for live video recording, which is a time-critical mission.