Review: High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

Published in

Analytics Vidhya

10 min readSep 29, 2020

Helloooooo guys! In the previous post, we have gone through the introduction to image inpainting and the first GAN-based inpainting algorithm, Context Encoders. If you have not read the previous post, I highly recommend you to have a quick look of it first! This time, we will dive into another inpainting method which can be regarded as an improved version of Context Encoders. Let’s start!

Recall

Here, I briefly recall what we have learnt in the previous post.

Deep semantic understanding of an image or the context of an image is important to the task of inpainting, and (channel-wise) fully-connected layer is one way to capture the context of an image.
For image inpainting, visual quality of the filled images is much more important than the pixel-wise reconstruction accuracy. More specifically, as there is no model answer to generated pixels (we do not have the ground truth in real-world situations), we just want look-realistic filled images.

Motivation

Figure 1. Qualitative comparison of the inpainting task [1].

Existing inpainting algorithms can only handle low-resolution images because of the memory limitations and the training difficulty in high-resolution images.
Although state-of-the-art inpainting method, Context Encoders, can successfully regress (predict) the missing parts with certain degree of semantic correctness, there is still room for improvement in the textures and details of the predicted pixels as shown in Figure 1.

Introduction

Context Encoder is not perfect. i) texture details of the generated pixels can be further improved. ii) not able to handle high-resolution images.
At the same time, Neural Style Transfer is a hot topic in which we would like to transfer the style of an image (style image) to another image with its same content (content image) as shown in Figure 2 below.

Figure 2. Example to illustrate the task of style transfer [2]

Note that textures and colors can be regarded as a kind of styles. The authors of this paper employ the style transfer algorithm to enhance the texture details of the generated pixels.

Solution

The authors employ Context Encoder to predict the missing parts and get the predicted pixels.
Then, they employ style transfer algorithm to the predicted pixels and the valid pixels. The main idea is to transfer the style of the most similar valid pixels to the predicted pixels to enhance the texture details.
In their formulation, they assume the size of the test images is always 512x512 with a 256x256 center missing hole. They use a three-level pyramid way to handle this high-resolution inpainting problem. The input is first resized to 128x128 with a 64x64 center hole for a low-resolution reconstruction. After that, the filled image is up-sampled to 256x256 with a 128x128 coarse filled hole for the second reconstruction. Finally, the filled image is again up-sampled to 512x512 with a 256x256 filled hole for the last reconstruction (or one may call it refinement).

Contributions

Propose a framework which combines the techniques from Context Encoders and Neural Style Transfer.
Suggest a Multi-scale way to handle high-resolution images.
Experimentally show that style transfer techniques can be used to enhance the texture details of the generated pixels.

Approach

Figure 3 shows the proposed framework and actually it is not difficult to understand. The Content Network is a slightly modified Context Encoder while the Texture Network is a pre-trained VGG-19 network on ImageNet. For me, this is an early version of coarse-to-fine network which can operate at multi-scale. The main insight of this paper is how they optimize the model (i.e. the design of the loss function).

Content Network

As mentioned, the content network is the Context Encoder. They first train the content network independently. Then, the output of the trained content network will be used to optimize the entire proposed framework.
Refer to the structure of the content network in Figure 3, there are two differences from the original Context Encoder. i) The channel-wise fully-connected layer in the middle is replaced by the standard fully-connected layer. ii) All the ReLU or Leaky ReLU activation function layers are replaced by ELU layers. The authors claim that ELU can better handle large negative neural responses than ReLU and Leaky ReLU. Note that ReLU only allows positive responses to pass through.
They train the Content Network using the same way as the Context Encoder did. A combination of L2 loss and Adversarial loss. You may refer to my previous post for details.

Texture Network

I will try to explain more about the texture network here as it is related to the topic of neural style transfer. Interested readers may google it for further details.

The objective of the texture network is to ensure that the fine details of the generated pixels are similar to the details of the valid pixels (i.e. we want to have consistent style/texture of an image)
Simply speaking, the authors make use of the findings in [2]. To some extent the feature maps at different layers inside a network represent the image styles. In other words, given a trained network, if two images have similar feature maps inside the network, we may claim that the two images have similar image styles. To be honest, this is an over-simplified claim. In [2], the authors employ a pre-trained VGG network on ImageNet for classification as a feature extractor. They suggest a Gram matrix (also called autocorrelation matrix) of feature maps at each layer in VGG. If two images have similar Gram matrix, they have similar image styles such as textures and colours. Back to the inpainting paper, the authors also use the pre-trained VGG network as their Texture Network as shown in Figure 3. They try to enforce that the responses of the feature maps inside the center hole region are similar to that outside the center hole region at several layers of the VGG. They said that they use the relu3_1 and relu4_1 layers for this calculation.

Loss Function

The total loss function consists of three terms, namely, content loss (L2 loss), texture loss, and TV loss (total variation loss).

The above is their joint loss function that they want to minimize. Note that i is the number of scales and as mentioned, they employ 3 scales in this work. x is the ground truth image (i.e. image in good condition without missing parts). h(x_i, R) returns the colour content of x_i within the hole region R. phi_t(x) returns the feature maps computed by network t given an input x. R^phi denotes the corresponding hole region in the feature maps. The last term is the total variation loss term which is commonly used in image processing to ensure the smoothness of an image. alpha and beta are the weights to balance the loss terms.

For the content loss term, it is very easy to understand, just compute the L2 loss to ensure the pixel-wise reconstruction accuracy.

For the texture loss term, it seems a bit complicated but it is also easy to understand.
First, they feed the images to the pre-trained VGG-19 network to obtain the feature maps at relu3_1 and relu4_1 layers (middle layers). Then, they separate the feature maps into two groups, one for the hole region (R^phi) and another for the outside (i.e. valid region). Each local feature patch P is with size of s x s x c (s is the spatial size and c is the number of feature maps) inside the hole region. What they do is to find the most similar patch outside the hole region then compute the average L2 distances of each local patch and its nearest neighbour.
In Eq. 3, |R^phi| is the total number of patches sampled in the region R^phi, P_i is the local patch centered at location i, and nn(i) is calculated as,

Eq. 4 is used to search for the nearest neighbour of each local patch P_i.
Finally, the TV loss is computed as,

Again, this is commonly used in image processing to ensure the smoothness of an image.

Experimental Results

Same as the Context Encoder, two datasets are used for evaluation, Paris StreetView [3] and ImageNet [4] datasets. The Paris StreetView consists of 14,900 training images and 100 test images; ImageNet contains 1.26M training images and 200 test images are randomly selected from the validation set.

Table 1. Quantitative comparison on Paris StreetView dataset. Higher PNSR is better. [1]

Figure 4. Visual comparison with different approaches [1]

Table 1 shows the quantitative results of different methods. Higher PNSR means better performance. It is obvious that the proposed method in this paper offers the highest PNSR.
The authors also claim that quantitative evaluation (e.g. PSNR, L1 error, etc.) may not be the most effective metric for image inpainting task as the objective is to generate realistic-looking filled images.
Figure 4 is the visual comparison with several methods. From the zoom-in versions of (d) and (e), we can see that the proposed method can generate sharper texture details than the state-of-the-art method, Context Encoder.

The Effects of Content and Texture Networks

Figure 5. (a) input image (b) result without the content loss term (c) result from the proposed method [1]

The authors provide the ablation study of the loss terms. Figure 5 shows the result without using the content loss term. It is clear that without the content loss term, the structure of the inpainting results is completely incorrect.

Figure 6. The effect of using different texture weight alpha. [1]

Apart from showing the content loss term is necessary. The authors also show the importance of the texture loss term. Figure 6 shows the effect of different texture weights alpha in Eq. 1. Obviously, more texture loss term gives sharper results but it may affect the overall image structure as shown in Figure 6(d).

The Effect of Adversarial Loss

As mentioned, the authors use the same way as Context Encoder to train the Content Network. They show the effect of just using L2 loss and using both L2 and Adversarial loss.

Figure 7. (a) output of the content network trained with just L2 loss. (b) final result of the proposed method using (a). (c) output of the content network trained with L2 + adversarial loss. (d) final result of the proposed method using (c) [1]

From Figure 7, we can clearly see that the quality of the output of the content network is important to the final result. It is shown that the content network is better to be trained using both L2 and adversarial losses.

High-Resolution Image Inpainting

As mentioned before, the authors suggest a multi-scale way to handle high-resolution images. The results are shown in below,

Figure 8. Visual comparisons of ImageNet results [1]. From top to bottom: Input, Content-Aware Fill, Context Encoder, the proposed method.

Figure 8 shows the high-resolution image inpainting results. For Context Encoder, it only works for 128x128 input images. So, the results are up-sampled to 512x512 using bilinear interpolation. For the proposed method, the input will go through the network three times at three scales to complete the reconstruction. It is obvious that the proposed method offers the best visual quality compared to the other methods. However, because of the multi-scale way to high-resolution image inpainting, the proposed method takes roughly 1 min to fill in a 256x256 hole of a 512x512 image with a Titan X GPU, which is a major drawback of the proposed method (i.e. low efficiency).

Real-world Scenario (Object Removal)

The authors further extend the proposed method to handle irregular shapes of holes. Simply speaking, they first modify the irregular hole to a bounding rectangular hole. Then, they perform cropping and padding to position the hole at the center. By doing these, they can handle images with irregular holes. Some examples are shown in below,

Figure 9. Arbitrary object removal [1]. From left to right: input, object mask, Content-Aware Fill result, the proposed method

Conclusion

This is an obvious improved version of the Context Encoder. The authors adopt the techniques from Neural Style Transfer to further enhance the texture details of the generated pixels by the Context Encoder. As a result, we are one step closer to realistic-looking filled images.
However, the authors also point out some future directions for improvement. i) It is still difficult to fill the missing parts when the scene is complicated as shown in Figure 10. ii) The speed is a problem as it cannot achieve real-time performance.

Figure 10. Failure cases of the proposed method [1]

Takeaways

Again, I would like to highlight some points here and the points are useful for the future posts.

This work is an earlier version of coarse-to-fine network (also called two-stage network). We first reconstruct the missing parts and the reconstructed parts should be with certain pixel-wise reconstruction accuracy (i.e. ensure the structure is correct). Then, we refine the texture details of the reconstructed parts such that the filled images are with good visual quality.
The concept of texture loss plays an important role in later image inpainting papers. By employing this loss, we can have sharper generated images. Later, we usually achieve sharp generated images by using Perceptual Loss and/or Style Loss. We will cover them very soon!

What’s Next?

Next time, we will dive into another milestone in deep learning-based image inpainting algorithms. I must say that so many inpainting papers are based on their network architecture! Hope you enjoy this post :)

References

https://arxiv.org/pdf/1611.09969.pdf
Leon A. Gatys et al. “A Neural Algorithm of Artistic Style,” https://arxiv.org/pdf/1508.06576.pdf
C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? ACM Transactions on Graphics, 2012.
ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Many thanks again. Thanks for spending time on this post. If you have any questions, please feel free to leave comments :) See You Next Time!