In my previous post we had a look at how, here at idealo.de, we trained a convolutional neural network, featuring residual and dense connections, for Image Super-Resolution (ISR) to upscale images with a higher quality than most other interpolation approaches.
In this post we will continue where we left: noise in the input image can be a severe issue with the previous approach. To fix this we will tinker with the training set and the loss function. We will make use of deep features from a common classification network, the VGG19, and introduce an adversarial network to enhance how and what our ISR network learns.
As always, this project is available on GitHub where you can find a variety of pre-trained weights, as well as a few Colab-notebook tutorials, so you can use the free GPU provided by Google to play around with the project, experiment, or just improve the quality of some old pictures of yours.
In the following we will have a look at:
- how to train the network to add compression artefact cancellation to the ISR process;
- how model averaging can be used in ISR to optimize the results;
- enhance the detail quality by introducing a perceptual component to the loss function using convolutional features from the VGG network;
- achieve more realistic results adding an adversarial component to the loss function using a Generative Adversarial Networks (GANs) training strategy and see how this affects learning.
The compression artefacts problem
In the previous post we saw how the model we obtained delivered good results on noise-free and detailed images, but we ended with a fairly bitter end, this image:
The noise present in the input image is ‘magnified’ too and the result does not look aesthetically pleasing; this ultimately hinders the quality of the image super-scaling process.
As it turns out, compression artefacts are not only an issue of a few images, like the sandal, but it is indeed a wide spread illness across the images in our product catalog.
Ensuring that these artefact are not worsened, but possibly improved, during the super-scaling process is therefore a crucial point for our work.
In the figure below we zoom in some details of the kids bedroom, as well as a detail from a purse. Like in the sandal example, the noise is sharpened and the super-scaling process worsen the perceptual quality of the image.
How image restoration fits in the ISR picture
The network we use here for image super-resolution is a Residual Dense Network (RDN) like the one described in the 2018 paper from Zhang et al.
What allows a RDN, and most SR network, to create a super-scaled version of an input image is the construction of a very large amount of full-sized (same size of the input image) intermediate representations of the input image which are then recombined to form a larger one. In this case the intermediate representations of the LR input image are the outputs of the Residual Dense Blocks (RBD), which are then reordered through an operation called Pixel Shuffle, a more efficient way of doing Deconvolutions for image super-resolution (read all about it here) and recombined through a couple of convolutions. For a better understanding of Residual Connections, read the very influential paper Deep Residual Learning for Image Recognition (K. He et al. 2015).
Therefore it seems reasonable to include some level of image restoration which would fit in the construction of these full-size representation, or while recombining them.
Naive compression noise removal
As a first remedy, we introduce compression as a preprocessing step for the low-res images of the DIV2K dataset in order to reproduce the artefacts we observed. We re-train the model obtained from the un-compressed DIV2K on this preprocessed dataset using the same settings and pixel-wise Mean Squared Error (MSE) loss. Our hope is that training the network to reconstruct artefact-free super-scaled image from their smaller, compressed counterpart would enforce the network to learn to discriminate the noise patterns and possibly remove it.
Here are some examples of the network trained on dataset preprocessed with different level of compression
Depending on how heavily the input images are compressed at training time, the network learns to “smoothen” the image, clearing compression noise at different levels.
This smoothing effect is not precisely what we hoped for, because much of the detail sharpness is lost. This can be a desirable effect for very noisy images, but it definitely is not suitable for less corrupted ones and finding the right balance can be tricky.
Notice that we obtained different level of smoothing by preprocessing the input images with different levels of compression: with this strategy, depending on the desired level of noise reduction we would need to train and use a different model . This is definitely not a scalable solution.
Another approach can be averaging the outputs of two or more models, for instance taking the pixel-wise average of the outputs of a non-noise cancelling network and a noise cancelling one. The main issue here is having to perform a double inference step.
Directly averaging the model weights addresses both these issues. For instance, given two trained set of weights w1 and w2, we define the new set of weights as
with 0 ≤ alpha ≤ 1.
The animation below shows the outputs of the interpolation of the weights of a 90% noise-reduction with a no-noise-reduction network for different alpha coefficients.
Perceptual loss to enhance the super-scaled image
The previous approach to noise cancelling proved not to be working as we hoped: in order to minimize the pixel-wise MSE loss, the network indiscriminately smoothes the output image to avoid sharp artefacts, which would ultimately increase the average error. This has the drawback of diminishing the sharpness of the textures and patterns that we would want to instead enhance, or even erase them completely.
What we really want is the network to learn patterns idiosyncratic to the noise and replace the corresponding portions with a realistic approximation of the original image, hence preserving the precious details and deliver an image that is both clean of noise and with a good perceptual quality.
A common effect of using a pixel-wise MSE loss function for image super resolution is a smooth output: minimizing the MSE for each pixel leads to “average good results”, which are though not correlated with perceptually pleasing results, those crisp details that we are looking for.
The idea would then be to use a loss function that penalizes more heavily reconstruction errors committed on perceptually relevant elements of the image such as edges, lines and shapes, textures and so on.
As you can read in the 2018 paper from Zhang et al. “The Unreasonable Effectiveness of Deep Features as a Perceptual Metric”, these perceptually relevant parts of the image correlate very well with the deep features, or intermediate convolutional feature maps, of common classification networks such as the well known VGG network. The intuition here is that what helps a network to classify an image, is also what the human eye pays attention to.
Thanks to this more “concentrated” loss, the network can abstract away from the pixel level and focus its efforts in reconstructing the visually important parts.
As the features and structures of the compression artefacts are quite certainly not the most relevant features for image classification, the optimization landscape should be less dependent on this noise, hence allowing the network to successfully ignore it during reconstruction.
For all these reasons, we decided to give deep features a try at preserving some extra level of detail while trying to reduce the amount of noise present in the image.
In this example, we take a combination of the pixel-wise MSE and two intermediate output layers of the VGG network, weighing them with W1, W2 and W3 respectively, so the training setup can now be represented as follows:
We retrain again the network obtained from the first training session on non-compressed DIV2K images, on the dataset preprocessed with 50% compression rate.
On one hand, there seems to be a decently higher level of detail, while the noise is in good part removed. On the other hand, when inspecting these images more closely or going back to the more noisy sandal image, a new form of “square-like” artefacts can be identified.
Adversarial Network Loss
The introduction of these new artefacts is a known issue. Looking at the literature, VGG perceptual loss is often introduced in a Generative Adversarial Networks context: seemingly, while the deep feature loss enforces higher structure reconstruction, a discriminator network trained to discriminate between real HR and fake SR images is able to “shape” these artefacts into realistic looking images.
On top of this, GANs are known for their ability of generating real looking image and audio by “making sense” of more or less random input vectors, or noise. Much in the same way, training an ISR network in a GANs fashion might help to process the compression noise and “create” realistic looking detail out of it.
Enough for the speculation: we construct our discriminator network taking inspiration from the excellent paper Photo-Realistic Single-Image Super-Resolution Using a Generative Adversarial Network (SRGAN, Ledig et al. 2017), to which we apply some very light modifications to improve training (see the Github page for more details) and re-train the model we obtained using the deep features on the 50% compressed DIV2K dataset.
The whole training setup can now be represented as follows:
And this is what we obtain when applying to our reference images: at the top, and left, are the outputs of the pixel-wise MSE trained network, at the bottom, and right, are the outputs from the same network trained using a perceptual loss with Deep Features and Adversarial component.
Through the looking glass
Interestingly, despite having trained the model on a dataset whose input images were preprocessed with a single level of compression, 50% in this case, it is able to perform different level of noise reduction based on the actual amount of noise present in the input image.
Arguably, it seems to have learned the distinct features and patterns of compression noise — precisely what we aimed for!
We can now have a look at the activation maps of the two networks we showcased here, the one trained with pixel-wise MSE and the one with Deep Features and Adversarial components, to get a glimpse of where the difference in learning might reside.
For brevity, I will only show 3 sample activation maps taken from the output of a few “significant” layers because what we observe here is a general behavior across all activation maps. Also, what layers are “relevant” for this particular observation comes from my personal experience, so I highly encourage you to open the Colab notebook provided in our Github repository and play around with it.
The following image shows, in each column pair, the first 3 activation maps of the chosen layers obtained by feeding the network with the noisy flower picture previously shown: each pair shows on the left the activation maps of the pixel-wise MSE model while on the right the perceptual loss model.
As we can see, the output of the first RDBs (here showing the 1st and the 5th) are very similar across the two models. The deeper blocks on the other hand have activation maps that are less similar to the original image, and they are then recombined in the Global Feature Fusion (GFF) and Conv Layers in the upsampling block (UPN) into less noisy reconstructions of the original image. Possibly, the later activation maps distantiate themselves from the input image as a the result from the higher abstraction encouraged by the new loss function. This “lower fedelty” might be the key component to the enhancement of detail as well as noise reduction.
Using some more advanced loss function and training technique we were able to get rid of the compression noise and retrieve fairly good looking super-scaled images starting from a poor ground truth. Here I did not discuss how the Discriminator network looked like, what effect different VGG network layers have on the output, how to train the Discriminator and the overall GANs training. If you wish to understand more about this interesting technology check out our Github repository, where we ready a few Colab notebooks and pre-trained weights to play around with, as well as describing more in detail how the training was carried. At the end of this post, I’ve linked all the relevant literature, so you can dig even deeper into the topic. For any further question do not hesitate to drop me a line here, on Linkedin or on Github.
Please let me know if you found this article useful (👏🏻) so others can find it too, and share it with your friends. You can follow me here on Medium (Francesco Cardinale) or on Linkedin to stay up-to-date with my work. Thanks a lot for reading!
Links and resources
- GitHub repository: Image Super-Resolution
- Pip package: ISR
- Documentation: https://idealo.github.io/image-super-resolution/
- SRGAN: Photo-Realistic Single-Image Super-Resolution Using a Generative Adversarial Network (Ledig et al. 2017)
- On perceptual metrics: The Unreasonable Effectiveness of Deep Features as a Perceptual Metric (Zhang et al. 2018)
- RDN: Residual Dense Network for Image Super-Resolution
- ResNet: Deep Residual Learning for Image Recognition (He et al. 2015)