A Deep Learning based magnifying glass

At idealo.de we trained a state-of-the-art Deep Convolutional Neural Network to 🔎super-scale🔎 small images.

Super scaling a low-resolution butterfly.

When shopping online, we would like to have an experience that is as close to a real, hands-on experience as possible. Large and crisp images enhance a customer’s e-shopping journey by both allowing a more detailed inspection and improving the overall feeling of a purchase. At idealo.de (the leading price comparison website in Europe and one of the largest portals in the German e-commerce market) we know this very well and we want to provide a platform that is user-friendly and appealing. In addition to our other image quality project, where we harness the power of Deep Learning to assess the aesthetic and technical quality of millions of hotel images, we decided that small, uninformative, often pretty ugly product images are a no-go for our product catalog.

Not all shops can provide a platform with good, high quality images for their products: they are often too small, of low-resolution (LR) and/or low quality to properly fill a product catalog page. To address this problem we implemented and trained a state-of-the-art convolutional neural network (CNN) based on the 2018 research paper Residual Dense Network for Image Super-Resolution (Zhang et al.). The goal is conceptually very simple: take those small images and, just like with a real magnifying glass, get a bigger version of it.

In this article, we will show you the steps that we took to achieve our goal, the results that we obtained so far, as well as a couple of interesting technical details and things we learned along the way.

Check out our implementation on GitHub for more geeky details.


Overview

As in most deep learning projects, there are four main steps:

  • review what great researchers have done on the topic;
  • implement one or more of the solutions and possibly compare pre-trained versions;
  • get some data, train and test your model;
  • improve the model and the data to get ad hoc results.

Here specifically we will:

  1. introduce our training setup and how we evaluated the results;
  2. have a look at the early results and understand what went well and what could be improved;
  3. point at some of the directions we intend to explore for the continuation of this project.

Training

Unlike “standard” supervised deep learning tasks, for this problem the output of the CNN are not just class labels or scores, but entire images. This means that the training process as well as the evaluation of the results are slightly different: the “labels” are original high-resolution images and for evaluation we need a way of measuring the “quality” of the scaled output image. I will provide more details on the advantages and complications that this sort of tasks imply in a later section.

ISR training flow.

Loss

The loss function is how we tell the network how good of a job it did. There are many ways of evaluating it, and the nature of the problem leaves room for creativity (smart people for instance have used high level features and an adversarial network). For the first iteration, we did stick with the standard: a pixel-wise mean squared error (MSE) between the Super Resolution (SR) output of the network and the high-resolution (HR) original.

Evaluation

A metric for evaluating the quality of the resulting image is the Peak Signal-to-Noise Ratio (PSNR), which is based on the MSE between two images. We chose this metric in order to be able to compare it with the values presented in other research papers, as it is the most commonly used.

Setup

We trained on a p2.xlarge AWS EC2 instance until the validation loss converged, which took around 90 epochs (24h time). We kept track of loss and PSNR values for both the training and the validation set using Tensorboard.

Tensorboard graphs of the 90 training epochs. Top-left: training loss that is back-propagated to the network at the end of each epoch. Top-right: the loss on the out-of-training set that we used to keep track of generalization performance. Bottom left and right: the PSNR values for training and validation set, respectively.

Results

We now have a look at some of the resulting outputs to see how we have done and to try to gain some insight and possibly understand how to improve the results.

Below we can see on the left side the entire images from the validation set, with some patches being highlighted. In the center the patches extracted from the output of our CNN and to the right the same patches extracted from the images scaled up with the standard process at idealo (which uses GIMP’s image scaling feature).

LR image (left), reconstructed SR (center), GIMP baseline scaling (right). Source: DIV2K dataset.

The results are definitely not perfect, for instance it is easy to spot unwanted artifact noise around the butterfly antennas, but details such as the hair around the neck and on the back of the butterfly and the contour of the spots on the wings in the output of the network look definitely crispier than in the baseline.

Understanding the results

To understand where our model generalizes well and where it does not, we extracted patches that have high PSNR values and patches that have a low value of the metric from the validation set.

Unsurprisingly, the best performing patches are the ones with large flat areas while more complex patterns are harder to reproduce accurately. We might then want to focus on these areas for both training and evaluation of the results.

The same is also highlighted by this a heat map representing the error between the original HR image and the SR output of the network: darker colors correspond to higher pixel-wise mean squared error, while lighter colors correspond to lower error, or better results.

Heatmap for pixel-wise HR-SR error. Darker colors mean higher error and lighter lower error, or better results.

We can see how areas with more patterns correspond to higher errors but also intuitively “simpler” transition areas (cloud-sky for instance) are fairly dark. This is something that might be improved upon, as it will be relevant for the idealo’s catalog use-case.

A few words on non-standard ground truth in deep learning tasks

Unlike more common supervised deep learning tasks where the labels are either categorical or numerical, the ground truth that we use to evaluate the output of the network is the original HR image.

This is both good and bad news. Bad news first: popular deep learning frameworks like Keras do not have pre-made solutions for training (such as generators) that can be applied in this settings, in fact they typically rely on fetching training/evaluation labels from a 1-dimensional array or file, or they derive them from the folder structure, so there will be some extra coding involved (is this really bad news?). The (very) good news is that there is no need to go great lengths to get the labels: given a decent pool of HR images, we can simply downscale them to obtain our LR training data and use the original HR to evaluate the loss.

Normally, when training a neural network on image data, the training batches are created by randomly selecting a number of images from the training set. These are then re-scaled to a smaller size, typically around 100x100 pixels, augmented on the fly with random transformations, and fed to the network. In this context, feeding the network with whole images is neither necessary nor desirable. This is mainly due to the fact that we can not rescale the images down to a say 100x100 little training point. We want to scale them up after all. At the same time, we can not afford to train it with large sized images (such as 500x600), as it would take a very long time to process. Instead, random patches of very small size (down to 16x16) can be extracted from the whole picture, giving us a whole lot more data points to play with, as each image can be the source of hundreds of different patches.

The reason why we can afford to take very small portions of the images is that we are not classifying a bunch of patterns into categories (legs + tails + whiskers + dead mouse =? cat). Hence the missing usual dense layers at the end of the architecture. We only need the network to construct an abstract representation of those patterns and learn how to scale them up (and recombine them so that the image makes sense). This abstract representation is done by the convolutional layers which, together with the upscaling layer, are the only type of layers in this network.

On a related note, the fully convolutional architecture makes this network input size independent. This means that, unlike many other CNN used for classification, you can feed the network with images of any size: no matter what the initial size is, the network will give as an output an image of double its original size.

For more details about the RDN check the paper linked in the introduction and at the bottom of this article.

On the flip side, one extra decision is needed, how to extract these patches from the images. We boiled it down to: extract n random images from the dataset, extract (and augment) p random patches from each one of them. We ended up trying a few ways of doing this, summarized by the picture below.

Pictorial description of the different feeding methods we tried.

At first we created an entire dataset of patches extracted following a uniform grid. At training time we would randomly extract batch_size of them, augment them on the fly and feed them to the network. This approach had the downsides of having a VERY large dataset that needed to be statically stored, which is not ideal if you want to use a cloud service for training: moving and extracting the dataset is a fairly time consuming operation, as well as having a deterministically defined dataset, which might not be optimal. An alternative approach we tried was to randomly select batch_size whole images and extract a single patch from it. The bottleneck for this approach turned out to be reading from disk, which drastically slowed down training (15 minutes to an entire hour per epoch with our setup).

We finally converged to randomly extracting one single whole image from the original dataset, and from it extract on the fly batch_size patches. This allowed us the storage of the original data set, while keeping training fast.

Going further

This was the first step towards magnifying idealo’s product catalog.

Below is the network output for a low quality, low resolution image from our product catalog.

Low resolution image of a sandal.
Super-scaled image of a sandal.

There is very noticeable noise where the image transitions from foreground object into flat background, and the text is also slightly distorted. These are the low hanging fruits that we plan on improving upon.

The next step will be training the network on our own product images dataset. Hopefully this will help with text and background/object contrast, which are not heavily present in the natural images coming from the DIV2K dataset. Another step on the wish list is incorporating noise-reduction by adding random noise at the time of down-scaling, but this is further down the line..

Please let me know if you found this article useful (👏🏻) so others can find it too, and share it with your friends. You can follow me here on Medium (Francesco Cardinale) to stay up-to-date with my work. Thanks a lot for reading!

Links

Github: Image Super Resolution

Paper: Residual Dense Network for Image Super-Resolution (Zhang et al. 2018)

Dataset: DIVerse 2K resolution high quality images