Paper Summary: Enhanced Deep Residual Networks for Single Image Super-Resolution

4 min readNov 17, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/02.

Enhanced Deep Residual Networks for Single Image Super-Resolution (2017) https://arxiv.org/abs/1707.02921 Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, Kyoung Mu Lee

Single image super-resolution, a.k.a. Let’s Enhance, is one of those magic holy grail killer apps that we all expect from our AI. Who wouldn’t want to extract the heretofore undiscovered pristine images that lie between the pixels of your grainy c. 2005 flip-phone snaps? Or solve crimes with your corner store’s poorly lit security cam feed? Sounds like a job for deep convnets, and indeed there’s been a lot of work in the last few years on upsampling and interpolation.

There are two general approaches for convnet-based super-resolution (SR) tasks: one is to pass in a bicubic upsampled image and the other is to upsample in the network. This paper takes the latter approach. Indeed, it seems that the main innovation here is in knowing what to remove and which existing tricks to take advantage of — Lim et al start with SRResNet from Ledig 2016 and apply a number of tweaks:

They removed the batch norm layers from the residual blocks, which improves performance without impacting training speed (their comment: “Since batch normalization layers normalize the features, they get rid of range flexibility from networks by normalizing the features, it is better to remove them.” — I’m not sure how much sense this makes, but it’s hard to argue against empiricism)
For the full single-scale model use residual scaling of 0.1 to avoid numerical instability (see below)
Pre-train the x3 and x4 networks with the x2 weights, leading to faster convergence
Preprocess inputs by subtracting out the mean RGB from the DIV2K dataset
Train on 48x48 low-res image patches (and their corresponding hi-res versions), augmented with flips and rotations
L1 loss led to better empirical convergence
Self-ensemble at test time: perform the 8 possible flip/rotation combinations and average the resulting output (applying the inverse of the flip/rotation on the output before averaging)

(For background on ResNet architectures see e.g. https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035)

(I’m still not quite sure why residual scaling works. They point to Szegedy 2016, which explains “Also we found that if the number of filters exceeded 1000, the residual variants started to exhibit instabilities and the network has just “died” early in the training, meaning that the last layer before the average pooling started to produce only zeros after a few tens of thousands of iterations. This could not be prevented, neither by lowering the learning rate, nor by adding an extra batch-normalization to this layer…. We found that scaling down the residuals before adding them to the previous layer activation seemed to stabilize the training.”)

Single-scale vs Multi-scale. The fast convergence from pre-training at x2 for the x3 and x4 networks is a hint that a single multi-scale model could take advantage of redundancies. Indeed, the authors propose such a multi-scale architecture (that they call MDSR — making up acronyms seems to be de rigueur for ML papers) that performs well despite having far fewer parameters than the full single-scale x2, x3, x4 networks. Some details:

When training MDSR, randomly select x2, x3, x4 scale for each minibatch and turn on/off scale-specific parts of the network (why these scales, by the way? presumably because that’s what the NTIRE2017 challenge tested)
The MDSR model includes a couple of 5x5 residual blocks right after the input, with different versions for x2, x3, x4 (it’s unclear to me exactly why they chose this architecture)

Shuffle upsampling. Oddly enough the authors don’t particularly emphasize their upsampling methods. They seem to use pixel shuffling (Shi 2016), the idea being that you have a convolutional filter for each upsampled pixel, followed by a pixel shuffle layer that arranges the filter pixels spatially into the larger grid in a pre-set order. Also, to remove checkerboard artifacts there’s a fix involving initialization from Aitken 2017.

One thing I’ve been noticing is there tends not to be dropout in convnet architectures, especially ResNets. The discussion here seems to indicate that dropout isn’t used because batch norm is enough of a regularizer, but that doesn’t explain why dropout wouldn’t help in this paper (since the batch norm has been removed). Idk. Another curious thing is that the main work this paper builds on (Ledig 2016) is really all about a GAN method for super-resolution, but there’s nothing GAN-like in this paper. Maybe it’s a question of performance?

In any case, here are some of the results (excerpted from figures 6 and 7 in the paper). EDSR is the single-scale model and MDSR is the multi-scale model. Metrics are PSNR (dB) and SSIM. Note in particular the reconstruction of clear text from the blurred downsampled image.

Ledig et al 2016 “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network” https://arxiv.org/abs/1609.04802

Shi et al 2016 “Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network” https://arxiv.org/abs/1609.05158

Szegedy et al 2016 “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning” https://arxiv.org/abs/1602.07261

Aitken et al 2017 “Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize” https://arxiv.org/abs/1707.02937

Paper Summary: Enhanced Deep Residual Networks for Single Image Super-Resolution

Written by Mike Plotz Sage