Possible issues of the loss for Deep Learning-based Super-Resolution

Published in

Analytics Vidhya

5 min readMar 30, 2021

Super Resolution

Super-resolution(SR) is the task of recovering high resolution(HR) images from their low resolution(LR) counterparts. Recent approaches on SR showed amazing reconstruction performances in terms of qualitative perceptual and quantitative benchmarks(PSNR, SSIM). Although many problems from previous approaches were resolved by further research, we still believe the current DL-based SR methods inherit fundamental problems, especially from their loss function. We will specifically focus on the problems of the current approaches to single image super-resolution(SISR) where we receive one single LR image and aim to output an HR image.

We will first take a very quick overview of current deep learning-based SR methods. For an in-depth overview of DL-based SR methods, refer to this paper and this blog post.

Model Architectures

There are multiple approaches to the task of SISR in terms of loss functions and model architectures. Some methods are depicted above. After the release of SRCNN first introducing convolutional neural networks to the task of SR, hundreds of variants that alter the model architecture were proposed.

https://paperswithcode.com/sota/image-super-resolution-on-set14-4x-upscaling

These methods include FSRCNN, VDSR, ESRCNN, and models based on residual blocks such as EDSR, MDSR, CARN. Recurrent network-based approaches and DenseNet block-based approaches were also introduced. Finally, attention-based architecture which mostly utilizes channel-wise and progressive-based models were proposed.

A variety of upsampling techniques such as ESRCNN pixel tiling, pre-upsampling, and deconvolution was also proposed and applied for various model architectures.

Content-based Loss

The first approaches such as SRCNN proposed a training algorithm based on content loss or mostly an MSE loss between reconstructed image f(LR) and HR image. This also directly favors the PSNR metric. This loss proposal seems very reasonable and straightforward but it inherits a very fundamental problem raised by the SRGAN paper.

These PSNR-oriented approaches tend to output smoothed results without sufficient high-frequency details since the MSE loss and PSNR metric fundamentally disagree with the subjective evaluation of human observers. That is because there are multiple possible outputs for one given LR patch, and the MSE-based solution typically finds the pixel-wise average of solutions, which might not exist on the true HR manifold and is smoothed. This is depicted in the figure below.

For example, there might be two HR images HR1 and HR2 which output very similar LR images. So, when only LR is given, the reconstruction f(LR) might be HR1 or HR2 but the MSE loss favors outputting the average of two possible HR patches.

Although some cases might be learned through very complex neural networks, this behavior of SR that the correct HR patch for one given LR image is not singular makes a fundamental limitation of MSE based solutions and has behavior to output overly-smoothed outputs.

Solutions to this Behavior

As a solution to this phenomenon of SR models outputting overly smoothed images, one branch of work proposes Generative Adversarial Networks(GAN). These works include SRGAN, EnchanceNet, ESRGAN, and the recent RSRGAN. GAN-based works usually use the sum of the content loss and adversarial loss. They also often utilize a perceptual loss, often the intermediate activation of the pre-trained VGG19 network. The equation below is the loss formulation of ESRGAN.

According to empirical results from the SRGAN paper, solely using GAN loss wasn’t sufficient for generating high-resolution texture details and had to be combined with perceptual loss.

GAN-based solutions do not perform comparably to MSE based models in terms of quantitative pixel-wise metrics such as PSNR and SSIM but show better perceptual quality and scores very high Mean Opinion Scores(MOS), which is the key interest in many cases.

Although GAN-based solutions are the best solutions that prefer perceptual quality and they are most successful in generating photo-realistic HR images, they inherit problems from GAN training. Observations on GAN-based SR solutions and my personal opinions find four practical and conceptual problems with learning GAN-based SR.

Unwanted artifacts: The generator often generates images that are far from the LR image, with unwanted artifacts present in the image. The instabilities of GAN training and all the issues below combined accounts for these artifacts.
Overfitting of D: Proposed by BigGAN as a fundamental problem in GAN training, the adversarial training never completes and the discriminator is prone to overfitting. I believe the overfitting of D could be more present in SR because the number of images is small and G does not receive any kind of noise, thus limits the capability to output distribution of images by G.
No capability to output distributions: One fundamental difference in the generator network between regular GANs and GAN for SR is the absence of noise or the latent vector. Therefore, the model can only output one image for one given LR patch. According to the STF-SR paper, the model must be able to output different SR results for different surrounding and texture information. I believe this is the key reason for easy mode collapse when training with a higher adversarial loss ratio.
Mixing of Losses: Mixing multiple losses can be beneficial in terms of introducing tradeoffs between two extreme states. But it also means that neither loss can be optimal measures for measuring the overall performance of SR. I believe there is a key element lost in GAN-based SR and must be resolved in terms of a modified loss function.

To conclude, the generator can only output one SR counterpart while there might be multiple solutions conditioned on the context of the complete image and texture of the specific part. This also might result in the overfitting of D because the generated images have less variation. Mixing losses also aren’t a fundamental solution for the incapabilities of both losses. I am very confident that there is room to conceptually improve the loss function for SR.

Downsampling operation

If the LR counterparts are generated through bicubic downsampling or another specific method LR=bic(HR), super-resolution aims to learn the inverse function of bic: f(LR)=HR. But in real-world problems, it is not guaranteed that the downsampling operation is of that specific operation used for training. We will discuss about this issue in the next post.