Super-Resolution trends at ECCV’18

Published in

Let’s Enhance stories

9 min readSep 24, 2018

You can’t deliver any value with your AI-driven product nowadays if you don’t keep up to date with the recent research in your field. And we all know where to look for the breakthrough ideas in Computer Vision — publications at the top-tier conferences, one of which is the European Conference on Computer Vision that I was lucky to attend in Munich this year. Sure, there were lots of decent papers published, but here I would like to briefly outline those of particular relevance to what we do at Let’s Enhance. All of the mentioned below papers were presented at the Poster sessions at ECCV’18.

So, as I promised in the previous post (check it out if you haven’t yet), here comes an article about the trends in Single Image Super-Resolution.

I am also planning to do a recap of the Perceptual Image Restoration and Manipulation (PIRM) Workshop and Challenge soon.

Important: those are notes, highlights, some ideas that I found interesting, but not the exhaustive paper summaries, as they lack some details; use the hyperlinks to read more in the papers.

Among general approaches that were dealing with various non-class-specific image data, there were few explicitly focused on the face super-resolution. I believe everyone who has ever worked with the image SR knows that faces are always hard to deal with. Your network may beat SOTA methods in PSNR, preserve textures, light balance, etc., but the faces will remain unrealistic, no matter what you do. So, separate papers on faces — one could see it coming.

To learn image super-resolution, use a GAN to learn how to do image degradation first

Why is SR sometimes referred to as a rather simple problem? Once I was told that it was boring in comparison to other image restoration tasks because of easiness one could get the training data with — perform downscaling to obtain low-res images — and you’re done! But what if you give up on the prior assumption that the low-res images that you’ve generated actually exist in the real world? ’Cause apparently, they don’t. Real-world low-resolution images clearly belong to different distribution than those generated via bicubic interpolation; thus, the GAN trained on the artificially downscaled data won’t generate realistic images.

To deal with this issue, the authors suggest a two-stage approach: first they train a High-to-Low GAN on the unpaired images, so it learns how to downscale the HR image. Learning the degradation process rather than modeling it helps to obtain realistic results when one has multiple degradations or the unknown ones (such as motion blur). It also solves the unpleasant data problem that always comes up in the image restoration tasks: paired data is hard to get. So they use different, not aligned datasets, on this first stage: HR dataset is a combination of Celeb-A, AFLW, LS3D-W, and VGGFace2, LR dataset is Widerface. The second stage is training Low-to-High GAN using now the output of the previous step and working in the paired setting.

The authors claim that the only other method that reports face super-resolution results for real-world LR facial images is the CVPR’18 paper of the same lab, which is face-specific, though, as it uses facial landmarks, and thus can’t be generalized and applied to other object categories.

Here are some impressive results:

Face Super-resolution Guided by Facial Component Heatmaps

Here authors claim that their method of Face SR outperforms SOTA results as it accounts on the facial structure, thus can capture pose variations. Also, they largely reduce the number of training examples (30K LR/HR images in comparison to 230K images to train a landmark localization network — SOTA face alignment method by the group of the authors of the paper above).

So, the main idea is that they use a multi-task CNN for upsampling and a discriminative network. The upsampling netwok consists of two branches: an upsampling branch and a facial component heatmap estimation branch that collaborate with each other.

The detection of facial components in 16x16 images is challenging, so the authors first super-resolve the features of the LR images, then employ a spatial transformer network to align the feature maps, and afterwards use the upsampled feature maps to estimate the heatmaps of facial components. The alignment of the feature maps is the key reason why one can reduce the size of the training data. The estimated facial component heatmaps provide also the visibility information, which cannot be deduced from the pixel-level information.

Visualization of estimated facial component heatmaps

Here are some results:

(a) Unaligned LR inputs. (b) Original HR images. (c) Nearest Neighbors of aligned LR faces. (d) Cascaded Bi-Networks. (e)Transformative Discriminative Autoencoders. (f) TDAE retrained with the authors’ training dataset. (g) Authors’ results.

To conclude, the main difference with the other FSR methods is that the authors use not only the intensity similarity mappings (low-level information) but also collect the structural information and use it as an additional prior.

Blindly increasing the depth of the network cannot ameliorate the network effectively.
Whether deeper networks can further contribute to image SR and how to construct very deep trainable networks remains to be explored.

Image Super-Resolution Using Very Deep Residual Channel Attention Networks

The authors suggest the way to make a CNN even deeper: 10 residual groups with 20 residual channel attention blocks each. But first things first.

They claim that the depth is important —and indeed it is, we all saw that EDSR and MDSR once made a splash. However, simply stacking residual blocks to construct deeper networks can hardly obtain better improvements, one has to come up with something new in terms of the architecture.

So, they introduce the RIR, residual in residual structure, where in each residual group they stack several residual blocks. That’s how one obtains both long skip connections and short skip connections. The foregoing mappings and the short-cut in the residual block allow to bypass the abundant low-frequency information.

Another contribution is the channel attention mechanism. The problem in the leading CNN-based methods is that they treat each channel-wise feature equally, lacking discriminative learning ability across feature channels. So, the introduced channel attention adaptively rescales each channel-wise feature concentrating on more useful channels.

The basic module:

RCAB, the building block that allows training a very deep - over 400 layers - network.

And here come some results:

Clearly aiming for the PSNR/SSIM values, in the paper you won’t see results that will allow you to test the perceptual quality.

Multi-scale Residual Network for Image Super-Resolution

The authors start with reconstructing the classic SR models, such as SRCNN, EDSR, and SRResNet. Based on the reconstruction experiments they claim that the said models are:

hard to reproduce (training tricks influence a lot)
inadequate of features utilization (not making full use of the LR image features, they gradually disappear as the depth grows)
poorly scalable (difficult to accommodate to any upscaling factors with only minor adjustments)

The suggested novel network architecture is called MSRN — Multi-scale residual network.

The building blocks of feature extraction part produce local multi-scale features. The MSRB consists of two parts: multi-scale features fusion and local residual learning. Different bypasses use the convolution kernels of different sizes to adaptively detect the image features in different scales. The residual learning is adopted to make the network even more efficient.

A MSRB has a convolution layer with 1×1 kernel as a bottleneck to obtain global feature fusion.

The outputs of each MSRB are used as the hierarchical features for global feature fusion. Finally, all those features are sent to the reconstruction module for recovering the HR image.

The authors train the network on DIV2K dataset with no initialization or any other training tricks, proving they can overcome the first disadvantage suggested beforehand — “hard to reproduce”.

The results:

Again, more focused on the parts with sharp edges, nothing on perceptual quality.

Results on other low-level computer vision tasks. This contribution looks like the biggest one for me as it opens up the way one can obtain a single multi-task model for image restoration.

Application examples for image denoising and image dehazing, respectively.

Fast, Accurate, and Lightweight Super-Resolution with Cascading Residual Network

As a motivation, here is a comparison of the various benchmark algorithms in terms of the Mult-Adds and the number of the parameters.

So, the main contribution here is straight-forward: a lightweight network. The authors call it CARN (Cascading Residual Network).

Global and local cascading connections
Intermediary features are cascaded and combined on 1x1 conv blocks.
Enables multi-level representation and multiple shortcut connections which makes information propagation effective.

Though, the advantage of multi-level representation is limited to the inside of each local cascading block —multiplicative manipulations such as 1×1 convolution on the shortcut connection can hamper information propagation, so it is natural to expect performance degradation.

But CARN uses both local and global cascading levels, and the global cascading mechanism eases the information propagation issues that occur when one uses local cascading only.

Efficient CARN

To improve efficiency of CARN, the authors introduce a residual-E block.

The approach is similar to the MobileNet, but the depthwise convolution is replaced with the group convolution. The user can choose the group size appropriately since it is in a trade-off with the performance.

To further reduce the parameters, they apply a technique similar to the one used by the recursive network. That is, they make the parameters of the Cascading blocks shared, effectively making the blocks recursive.

By changing the plain residual block to the **efficient residual block**, one can reduce the number of operations.

The results are as follows:

And … back to GANs.

SRFeat: Single Image Super Resolution with Feature Discrimination

The main idea: to employ an additional discriminator that works in the feature domain.

Previously there were already some great ideas on how to modify the loss function in order to capture the most the features could represent. SRGAN became a breakthrough introducing the perceptual loss — MSE on feature maps. Then EnhanceNet brough the texture matching loss (Gram matrix of feature map). What next?

The authors claim that, similarly to MSE on pixels, MSE on VGG features would not be enough to fully represent the actual characteristics of feature maps. So they add the adversarial loss on Feature maps and call their method the SRFeat.

They train the generator through two steps: pre-training, and adversarial training. In the pre-training step, they train the network by minimizing the MSE loss. The resulting network obtained from the pre-training step is told to be able to achieve high PSNRs already; however, it cannot produce perceptually pleasing results with desirable high-frequency information.

Then the training procedure is focused on minimizing the loss function that consists of perceptual similarity loss, image GAN loss, and feature GAN loss. The last one encourages the generator to produce structural high-frequency features and better regresses to a real distribution of features.

They use ImageNet for pre-training the generator and DIV2K for further training. The qualitative comparison of the results is below:

SRFear_IF means using both Image and Feature GAN loss.

The results indeed seem to be more accurate and realistic.

So, this was a short recap on the 6 most memorable posters from ECCV’18 on Single Image Super-Resolution. I hope you enjoyed and stay tuned for more!