Zero Shot Super Resolution: Part 1

MissingLink Team
MissingLink Deep Learning Platform
9 min readJan 29, 2019

Super-resolution (SR) is a class of techniques that enhance the resolution of an image. These methods aim to obtain a high resolution (HR) output from a low-resolution (LR) version. The objective in performing single image super-resolution (SISR) is to increase an image size with a minimal drop to its quality. The applications are numerous, from medical imaging, compression, agriculture analysis, autonomous driving to satellite imagery, reconnaissance and more.

The field of Super-Resolution is going through a period of renaissance. Recent progress in Deep Learning models like Convolutional Neural Networks and Generative Adversarial Networks sparked various new approaches and brought with them state of the art results to the problem which the more classical feature engineering-based methods did not yield.

In this three-part blog series, we will discuss the main methods in the field and will take a deep dive into a special method called Zero Shot Super Resolution which we will inspect and implement using Keras (with a TensorFlow backend) on the MissingLink deep learning platform.

Fig 1. Comparison of ZSSR performance VS. EDSR (Checkpoint Charlie)

Currently, there are two main approaches to SR in the world of deep learning.

The first approach: using a variety of convolutional neural networks, preferably with skip connections and try to minimize the L1 or L2 loss on the reconstruction of a high-resolution image from its low-resolution pair.

The second approach: Generative Adversarial Network, also known as GAN.

A GAN architecture can be trained to generate a distribution similar to that of a particular dataset. A GAN possesses two main parts: a generator and a discriminator.
The generator learns how to create the dataset distribution from random noise and the discriminator learns how to distinguish between real samples from the dataset to synthetically created samples from the generator. By training them together in an adversarial manner, each part is improved iteratively and the end result is a strong sample generator and a strong classifier for real vs. synthetic samples.

For example: by feeding a GAN the MNIST dataset its Generator learns of to create handwritten digits and its Discriminator learns how to tell real and synthetic digits apart.

Fig 2. Illustration a GAN for handwritten-digits (MNIST)

Image Source: https://sthalles.github.io/intro-to-gans/

We take this architecture and minimize the GAN / adversarial loss.

The generator part of a GAN model for SR usually uses the L2 loss (also known as MSE or the Mean Square Error) or the more modern perceptual loss, i.e. which is the Mean Square Error (MSE) but from a deep layer of a model pre-trained on Imagenet (usually VGG-16 or VGG-19) as reconstruction loss. The output of a pre-trained model gives us high-quality features without any time-consuming feature engineering. We compare the feature maps of the network output and the HR image.

All this boils down to the fact that most super-resolution systems have an L1 or L2 metric at their core. This makes sense because the standard metric for image reconstruction is PSNR (Peak Signal to Noise Ratio) has the MSE (L2 loss) built-in:

PSNR = 10 * log10((data_range ** 2) / mse), as defined in sci-kit image.

Data_range is the pixel range in our data type. Usually, 255 or 1.

A lower value for MSE means lesser error, and as seen from the inverse relation between the MSE and PSNR, this translates to a high value of PSNR. Logically, a higher value of PSNR is good because it means that the ratio of Signal to Noise is higher.

Fig 3. Creation of a high-resolution image

Image Source: Fast and Accurate Image Super-Resolution with Deep Laplacian Pyramid Networks

In a recent paper from Technion researcher Yochai Blau, empirical evidence was formulated to show that the two approaches are bound by an intrinsic tradeoff property. Its effects are visible in the fact that the myriad SR method in existence struggles to improve performance with regards to noise (distortion) and perceptual (quality) metrics simultaneously. In simpler terms, the more real a super-resolved image appears, the more noise or artifacts it has that might be of synthetic origin, and the less distortion and noise in the super-resolved image the blurrier it will look.

Fig 4. The Perception-Distortion Tradeoff

The issue with the current Super-Resolution (SR) methods is that most of them are supervised. This means the algorithm parameters are being learned by optimization over a paired dataset. Minimization of a loss function over a reconstruction of low-resolution images to their high-resolution counterparts.

And while some attempts have been made into producing datasets taken by digital cameras with intrinsically different resolution characteristics, most datasets are produced by down-sampling images using interpolation methods (usually Bicubic). This may be very efficient and cost-effective but it also means most datasets are not comprised of in-the-wild LR images. This has very deep implications, synthetically created datasets do not represent real distracting artifacts and common low-grade imagery phenomenon. This results in SR models that do not learn how to combat defects stemming from optical or digital issues but mostly artifacts ensued from synthetically creating lower-resolution (LR) images. And most troubling, models may only learn to reverse the down-sampling mechanism itself.

One paper from ECCV ’18 tries to tackle this faulty way of paired data creation deserves our attention. Hopefully, we can discuss it in depth in a future article.

Fig 5. Low-quality image up-scaling.

In a paper by Assaf Shocher from the Weizmann Institute, we find a novel, surprisingly simple and elegant method to achieve single image super-resolution. Instead of training large networks (EDSR~43M parameters) on large datasets for days or even weeks they use a small (~100K trainable parameters) Fully Convolutional Neural Network (FCN) and train on the target data itself and its many augmentations to produce a final higher resolution output. They produce data augmentations on the fly and produce LR-HR pairs by first creating a down-scaled copy of the original image and also crop, flip and rotate it. Then, they emulate a specific predefined super-resolution proportion by blurring the augmentation through down-sampling and then up-sampling it to its former size (equal to the size of its HR father). Resulting in a Blurry and Non-Blurry pair of augmentations, both down-scaled from the original image by the same random factor with the LR son blurred in a way that mimics a predetermined scaling.

Fig 4. Our version of ZSSR. An LR son image goes through the FCNN and finally added to its unchanged copy (Output = LR + FCN(LR)). We then calculate the L1 Loss of the output in regards to the HR father:

These pairs are then fed into the FCN net for training, with an objective of minimizing the L1 loss of an LR image reconstruction to its HR match. A simple architecture of 8 layers deep and 64 filters wide neural network with a ReLU activation function (for all layers but the last, which is linear) is used. A skip connection from the input to the output is set, thus we only need to learn a residual image from the blurry LR to the HR origin.

Fig 6. An illustration of the ZSSR algorithm.

The original image I is down-sampled by many different scaling factors. Each downscaled copy is blurred by down-scaling and then up-scaling it. This gives us an LR-HR pair that we can train by, comparing the net’s output f(LR) to its matching HR.

Finally, we test (model.predict) on the original image to produce a super-resolution output.

In the cases where there’s a ground-truth high-quality image, it can be used to compare between it to the net’s output by using the PSNR and SSIM metrics.

A very unusual fact about this architecture is that we train without a validation set and test only on the original sample itself. While that may be a bit counterintuitive at first, it fits our goals and obviously reduces runtime.

The neural network learns the rescaling (or mapping) function:

with Size(LR) = Size(HR)

Another important fact about fully-convolutional-neural-networks (FCN) is that we can use varying input sizes. Hence, each set of samples has a different size. By carefully picking the “right” hyperparameters to be Kernel size 3*3, stride 1 and padding equals ‘same’ we get an output size that is exactly the same as the input size, enabling us to compute their relative error. With this architecture we essentially only improve on the bicubic interpolation by changing its trivial output, with the resizing factor itself is a predefined parameter.

The scientific explanation for the success of this architecture is the existence of repeating internal features across scales in an image. It was shown in previous papers from the same group at the Weizmann Institute that small image patches of size 5*5 and 7*7 repeat (in an image) in different locations by both their original size and across scales. In a separate work by the same group, it was shown that single images have a lower internal entropy in comparison to large image datasets. This stands in agreement with repetition in internal image patches and self-similarity principals. A neural network that’s case-specific for the image in question is trained and then tested on it.

Fig 7. CSI Enhance SOTA.

The neural network is able to capture non-local recurrence of objects in the image despite its small receptive field. This stems from the image-specific training process. Learning is accelerated further by making it independent of the image size. This is achieved by the incorporation of a cropping mechanism. Only fixed size crop is taken from each image pair (when passes a threshold). Take into consideration that this hyperparameter can have a strong impact on runtime and performance.

Image-specific neural networks might sound like a cumbersome solution but the reality is that while supervised SISR methods can get impressive results for images downgraded by the same set of parameters, their performance diminishes greatly on real-life low-quality images. Learning how to combat real noise and artifacts requires creating ensembles of very deep neural networks each trained and specialized for a specific fault, and for many days or even weeks on that specific task.

Zero Shot Super resolution excels exactly where large supervised SISR models fail, on in-the-wild low-resolution images which suffer from noise, compression artifacts, etc. Parameters such as downscaling kernel, scaling factor additive noise level and its type on the training augmentations are open for selection, thus enabling a better fit to characteristics of the image (adding noise helps in improving results for low-quality LR images). Selecting these parameters requires re-training, thus making it unsuitable for larger architectures. The ZSSR paper shows impressive results on naturally degraded images.

We have reviewed the field of single image super-resolution, talking about the main branches of architectures and the intrinsic tradeoff property between them. We took a deep dive into an unsupervised method by the name of Zero Shot Super Resolution and discussed its points of strength.

In the next chapters, we will cover the project framework, implementation, and development, and the python packages will use in our application.

In future posts, we will demonstrate an implementation zero shot super resolution with the MissingLink deep learning platform.

Sign up now for a free account.

Originally published at missinglink.ai.

--

--