Discussing a novel approach for the loss of Super-Resolution

Published in

Analytics Vidhya

5 min readApr 1, 2021

Problems of Current SR

As I discussed in a previous post, the current loss functions that measure SR: MSE loss between HR and SR, adversarial training, the perceptual loss has some problems and aren’t fundamental objectives the model must achieve for best perceptual quality. We will shortly summarize the problems of currently used losses in measuring SR quality and propose my ideas on how to refine these methods.

I would like to note that these are personal observations and opinions by me, not a consensus of researchers. Some arguments, especially the third might be disputable. The problems are as follows.

Because the LR->SR mapping is not a one-to-one function, the model must have the capability to output multiple images with a stochastic variation. In other words, there are multiple answers to the SR problem.
Related to the first problem, the MSE based content losses don’t accept stochastic variations in the textures of the image and penalties high-quality images just because it has pixel-wise variations. Therefore, MSE based solutions often result in overly smoothed images.
Perceptual metrics proposed to address the second issue such as VGG loss and adversarial loss aren’t fundamental metrics for SR. The VGG loss seems like a miscellaneous technique and adversarial training inherits many problems and is highly unstable and not convergent. Neither method could work by itself, and only works by mixing losses.

Although recent SR models perform substantially and some points might not seem important in practical terms, I see the current loss formulations for SR as incomplete “tricks”.

What is Super Resolution?

I asked myself the most fundamental question on “what is SR?”. As stated in multiple articles, SR aims to learn the inverse function of downsampling. I then drew the following diagram and contemplated how we can reward the model for predicting every sky blue prediction that exists on the manifold of HR images. The problem with the MSE metric is that the model doesn't matter whether the SR image looks HR, but only takes into mind the pixel-wise distance(see the second figure).

Adversarial learning-based methods aim to learn the green HR manifold through learning the space with the discriminator and penalizing the generator based on the measurements of the discriminator. If the discriminator successfully learns the HR space, the generator must be successful at generating plausible images, but I believe this is disturbed through weight mixing, and the discriminator fails at learning the HR manifold completely. It is hard to make a complete conclusion, but I will diagnose whether the discriminator of ESPCN learned the true HR space and did not overfit to training data in another post.

The SR problem and the problem of MSE based solutions

Comparing the images at Downsampled space

I thought about the idea of comparing the HR and SR in the downsampled space, rather than directly measuring the distance. So, instead of MSE(HR, G(LR)), we use MSE(LR, bicubic(G(LR))) as the loss. This is illustrated in the figures below.

Illustration of the method. The diagram above is the regular SR pipeline, the bottom is my idea.

This way, the loss will penalize the model only when the image is not on the true manifold of the HR counterparts of the given LR image. This way, we can learn the HR space without adversarial training, at least in theory. With additional modifications, I believe we could directly lean perceptually pleasing images.

Implementation

I trained the 4x ESRGAN model based on this loss proposal. Sadly, the results were very artificial and weird. The results are as below. The columns each represent HR, SR, bicubic interpolated images.

Image patches from the reconstructed “baby” image.

At first, I thought the training collapsed due to high LR or other problems. But the model in fact learned to exploit this loss and learned to minimize the loss while not outputting expected SR images. This might have been obvious and I also expected this issue, but the results were very extreme. I seem that the latent space of possible HR images from one LR image is very large and some extra mechanism to enforce a “realistic” HR patch from the possible HR manifold must be implemented.

The following figures show the noisy images downsampled by 4x. They are very accurate and we can find that the model minimizes the loss while outputting unpleasing SR images.

Ways to Improve

One way to improve this method seems like adding adversarial training. In the previous mixture of content loss+adversarial loss, the content loss was opposing the adversarial loss, and we had to make a trade-off between two losses. Mixing our loss with the adversarial loss can minimize both losses together because our loss doesn’t penalize the model unless the reconstructed image is outside the HR manifold, which is bearable for the adversarial loss and can restrict GAN artifacts.

Also, we could use other upsampling methods instead of ESPCN such as pre-upsampling and deconvolution(transpose convolution). ESPCN tiles separate channels, which is not a problem for regular MSE loss but might be a serious problem in our case because it is much easier for the model to exploit the loss and separate the outputs of the channels since our loss is only measured in combined scale.

I appreciate criticism of my method and my arguments since I would like to hear from you and improve my idea through your opinions.