EXPEDIA GROUP TECHNOLOGY — DATA

Image Super-Resolution Using Attentional GAN at Expedia Group

An efficient machine learning solution to upscale images preserving the image quality

HARSH PATHAK
Expedia Group Technology

--

Authors: Harsh Pathak, Shervin Minaee, Xinxin Li, Thomas Crook

Seattle Skyline by Bryce Edwards on Flickr

Introduction

Travel websites thrive on high quality and engaging imagery. Images of Hotels are a powerful tool that can be used to create an exceptional customer experience. They allow the viewer to gain important information about the property and inspire the viewer’s imagination. Expedia receives millions of images from our users and hotel partners. We want to make sure the images displayed to our customers are very high quality and legally compliant. Obviously not all images meet these minimum requirements.

When we start with a small-size image we can either add black borders to compensate for the empty space in a gallery display or stretch the image to fit using standard upsizing. This often results in noticeable pixelation and artifacts. A pixelated image does not create the exceptional experience we want our customers to have. In this work, we investigate solutions to the problem of generating high-quality images from small-size images in a commercial setting. We developed a machine learning model to upscale images more dynamically in a way that preserves as much of the images’ quality as possible while allowing for a much larger display.

Challenges

We did an extensive literature survey and found numerous interesting approaches. But most of the prior works focus on small image sizes, meaning upscale a tiny image (say ~100px) by four times. Our business needs us to generate images at higher resolutions than that. We were tasked to generate images of at least 2000px — four times greater than the sizes typically reported in the literature. Such large images are popularly known as high-definition images, in this blog, we will address them as high-resolution (HR) images. A large portion of older images in our inventory do not meet this minimum resolution criterion so we need to upscale them. This poses a challenge. State-of-the-art models that were validated on tiny images (100px) have not been shown to handle larger images. In this resolution space (2000px) avoiding minor pixelation artifacts is extremely difficult. Images are scaled up and so are the pixelations. In addition, large-size images are more susceptible to the object consistency problem due to the larger number of pixels and long-term dependencies across different regions of the image. These challenges encouraged us to do a deep dive into the super-resolution literature and propose an efficient solution.

Data Source

SISR is a well-researched problem with broad commercial relevance. The classical super-resolution techniques (bilinear, bicubic, etc.) cannot provide perceptually appealing images in many cases. Developing a deep learning model to solve this Single Image Super-Resolution (SISR) problem appears to be a more generic and robust solution. We studied a few deep learning approaches that could potentially solve this problem. In order to evaluate these ideas, we needed thousands of images with high and low-resolution pairs for training the model.

We had a large repository of images with diverse scenes, objects, and localities from all over the world. Next, we needed pairs of high-resolution and low-resolution (LR) images in order to train our model.

For our first dataset (training) we used 20k HR (~2000 px) images and synthetically created the corresponding LR pairs for them by applying down-sampling. For our second dataset (testing) we had more than 3000 images of varying resolutions. We needed to achieve some minimum score on our testing dataset in order to get sign-off. Please note our test sets were original small-sized images from hotel partners and not synthetically generated.

Model Architecture

Earlier super-resolution (SR) models were mainly based on sparse coding. More recently, deep learning approaches have produced favorable results in many computer vision tasks. There are many interesting deep learning frameworks for SR (such as Convolutional Neural Network (CNN) based model, as well as Generative Adversarial Networks (GAN) based models), but here we focus on models of particular relevance to our work: Super-Resolution GAN (SRGAN¹) and Self-Attention GAN (SAGAN²).

Fine-tuning on a pre-trained SRGAN:

As a baseline, we apply a pre-trained SRGAN model⁵ to our test set of original smaller images. This model was trained with the RAISE dataset which comprises HR images from a large-scale resolution space (8,156 images ranging from 2500 to 4000px). Thus, we have reason to believe it is more compatible with our target resolution space than models trained on smaller size images in other publicly-available data sets such as ImageNet. But we observed plenty of pixelations and blue-patches at some locations of the image. Hence we decided to fine-tune this model with Expedia’s high-quality images. After only 10 epochs of training with 11.5k Expedia images, we were able to see significant improvements and almost all the blue-patches were gone. Given the above result, we think the earlier artifacts with the pre-trained models were due to domain differences between the RAISE and Expedia’s dataset.

We further analyzed a few thousand images manually and observed some ringing artifacts along with long-term dependencies such as walls, desks, pool edges, etc. We needed to extend the model to improve the object inconsistency³ that we saw in the fine-tune approach. To address this issue we enabled attention to the salient part of the image.

Self-Attention is expensive to fit in GPU memory for large-size images:

The idea of including an attentional component in the SR task is to capture long-term multilevel dependencies across image regions that are far apart and unseen by kernels. However, the amount of memory required to store the correlation matrix (i.e., attention map) of SAGAN’s self-attention layer is prohibitively large for large-scale images¹. For instance, the flattened correlation matrix for an input image of size 500*500px is 250k*250kpx, which is very costly to store in-memory.

To address this memory issue, we come up with an idea of flexible self-attention (FSA), which essentially uses pooling and un-pooling to get a smaller-sized attention map. Our FSA layer adds attention to the model without exploding memory for large-scale images. We wrap the SAGAN self-attention layer with max-pooling and then resize the image to match the shape of the input, as shown in Figure 1. Since the input and output feature maps are of the same size, the FSA can be inserted between any two convolutional layers. This wrapping reduces the size of the attention map, enabling us to perform attention on large size images on GPUs like Nvidia Tesla K80.

Figure 1: Flexible Self Attention Layer [1]

We then trained our model with 20k images and observed a significant improvement in the SSIM (structural similarity) score on our validation set. This provided a strong signal that adding attention to the model improves the structural consistency of the output images. That can also be seen in Figure 3 & 4.

Proposed Attention Model:

The A-SRGAN architecture extends SRGAN¹ with a Flexible Self Attention Layer (FSA) layer inspired by SAGAN². Figure 2 explains our model architecture. Note that the Learnable Sum operation refers to the weighted skip connection from SAGAN. In each layer, the weights are normalized using spectral normalization. The generator and discriminator networks of A-SRGAN are shown with their corresponding kernel size (k), the number of feature maps (n) and stride (s).

Figure 2: Proposed A-SRGAN Model [1]

Model Evaluation

Qualitative Results:

The generated high-resolution images for two examples by our model are provided below. We zoomed in over a small crop of each image since these images are at least HD. You need to zoom in 4–6x while viewing on a computer screen in order to see how the model is performing.

Figure 3: Model Outputs Zoomed in results. The image ref[1].

Below we also show a comparison between our model output, versus the one generated by the fine-tuned SR-GAN model on a sample image. You can see the fine-tuned SR-GAN output still has some artifacts around texts and roof areas, while our model output has significantly fewer artifacts and better structural consistency.

Figure 4: From left to right, (a) the super-resolution result of the SRGAN model on a sample hotel image. (b) The results by the proposed attentional GAN model, and (c) the original HR image. As can be seen, our proposed model does a better job around texts and regions with a lot of edges. Image ref [1]

Quantitative Results:

There are various ways to evaluate the performance of a SISR model quantitatively. One popular metric is the Peak Signal-to-Noise Ratio (PSNR) which is log-inversely proportional to the MSE between the output and target HR images. Another metric is SSIM, which is considered to have more correlation with human visual experience. But none of them can match the human score. In many studies, Mean Opinion Score (MOS) is used as a prime evaluation metric to illustrate the performance of a SISR model. MOS represents the average score given by humans to model generated images.

For MOS we sought help from professional media experts at the Expedia content team. They did a thorough independent evaluation of images (by zooming 4x) in multiple portions of the images and provided a binary score for all the images. This team is well trained to catch minor flaws and artifacts of the images. In Table 1, we report the MOS for the images with different resolution ranges. As we can see the generated images from the model were mostly accepted to be published on Expedia’s website. We have slightly lower MOS for images with smaller resolutions, and after our error analysis we found out it is because of large pixelation/blurriness in those input images.

Table 1. The success rate of the proposed SISR model on our test sets.

Figure 5 shows the distribution of accepted (blue) and rejected images (orange) as a function of their PSNR and SSIM score for the first test set above (which consists of images with resolution 350–650px). We can clearly see neither PSNR (left) or SSIM (right) provides a clear signal for the visual quality of an image. It may be surprising that for a PSNR higher than 33, the MOS is not correlated enough with the PSNR. After doing some investigation we noticed this is because both input and output images were blurry in that PSNR/SSIM range.

Figure 5: Here is the distribution of accepted images (blue) and rejected images (orange). We can clearly see neither PSNR (left) or SSIM (right) provides a clear signal for the visual quality of an image.

Conclusion

In this article, we discussed an end-to-end SISR model and we have a few takeaways. First, pre-trained models do not yield satisfactory results. A fine-tuned SISR model using domain-specific images generates the highest-quality results. Second, we found an efficient method to train the super-resolution model on images with input resolution 500px or more. Then we evaluated this method on our test datasets. The generated output images (2000px or 4k) were accepted by expert human evaluators with high confidence. Lastly, PSNR and SSIM are good indicators to evaluate SISR models, but we think MOS is the most appropriate measure to increase your confidence and build a strong business case.

For more details regarding this project please find the post-print of our paper here: “Efficient Super Resolution For Large-Scale Images Using Attentional GAN”, IEEE Big Data, 2018. (https://arxiv.org/abs/1812.04821). Recently, this work was empirically evaluated along with other SOTA Super-resolution methods in this paper⁶.

In this blog we share some details about this project and also try to provide a short and intutive tutorial. Please see Appendix for more details. We have a paper published in IEEE Big Data Conference. Post-print: https://arxiv.org/abs/1812.04821

Acknowledgment

This work was done in collaboration with several teams at Expedia including data science, content, UGC, and destination. We would like to thank Glenn Crowe for constructive feedback on this blog. Also, we would like to thank Xinxin Li, Shervin Minaee, Brooke Cowan, Thomas Crook, Thomas Mulc, Peter Barszczewski, Jesse Farmer, Gayatri Diwan, Toufik Bdiri, Etienne B-Dury, and many others for lots of valuable comments/inputs during this project. Finally, we thank Cliff DesPeaux, Zach Kuntz, and Maj Askew for their constant support of this project.

References

  1. Harsh et al. “Efficient Super Resolution For Large-Scale Images Using Attentional GAN”, IEEE Big Data, 2018. (https://arxiv.org/abs/1812.04821)
  2. Ledig, Christian, et al.“Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network”, CVPR. Vol. 2. №3. 2017.
  3. Zhang, Han, et al. “Self-Attention Generative Adversarial Networks.” arXiv preprint arXiv:1805.08318, 2018.
  4. Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. “Spectral normalization for generative adversarial networks.” arXiv preprint arXiv:1802.05957, 2018.
  5. https://github.com/brade31919/SRGAN-tensorflow
  6. Chaowei Fang, Guanbin Li, “Self-Enhanced Convolutional Network for Facial Video Hallucination”, IEEE TRANSACTIONS ON IMAGE PROCESSING

Initial photo by Bryce Edwards on Flickr.

Appendix A: Intuitive details on how SR models are constructed

In this section, we attempt to provide an intuitive explanation of 5–6 years of research on Super-Resolution using Deep Learning. Recent Evolution of Super Resolution in Deep Learning is shown below.

Figure 6: Evolution diagram of various Super-resolution approaches using deep learning. Image ref

In the following figures, we show similarities between different models and how they improve in recent years.

Figure 7: Block diagram of various Super-resolution approaches using deep learning. Image ref

Appendix B: Distributed Training of our model

With the recent advances of Estimators API distributing the training time across multiple GPUs have become very easy. Here we show how we distributed the training without estimators API.

Figure 8: Multi-GPU GAN flow. The green blocks on the left represent GPU:0–7, and GPU:0 is shown on the right. G refers to the generator network and D to the discriminator network of the GAN. The variables of both G and D are stored in GPU:0 and shared across all GPUs. When a batch of images is being processed, the gradients are calculated independently overall GPUs and then updated synchronously in GPU:0. Image ref[1]

--

--