Review of DeepGIN: Deep Generative Inpainting Network for Extreme Image Inpainting

Published in

Analytics Vidhya

7 min readSep 7, 2020

Hello everyone, ECCV’20 AIM Workshop has held an Extreme Image Inpainting Challenge this year (1st Image Inpainting Challenge). I would like to share one of the challenge papers, called DeepGIN in this post. Source code and related materials are available at their github project page: https://github.com/rlct1/DeepGIN

Objective

Filling in the missing parts in an image as shown in below.

Figure 1. The first row shows different masked images. The second and third rows display the completed images using the state-of-the-art and the proposed method. The last row is the ground truth images. The degree of difficulty in image inpainting depends highly on the scale and shape of the masked areas

Motivation

Existing image inpainting approaches usually encounter difficulties in completing the missing parts in the wild as they are trained for either dealing with one specific type of missing patterns (masks) or unilaterally assuming the shapes and/or sizes of the masked areas (see for examples in Figure 1. The larger the missing parts, the more the difficult the task)

Solution

The proposed model is a two-stage network, namely a coarse reconstruction stage and a refinement stage. The coarse reconstruction stage is responsible for a rough estimation of the missing parts, while the refinement stage is responsible for refining the coarse completed image

Contributions

Propose a Spatial Pyramid Dilation (SPD) residual block to deal with different types of masks with various shapes and sizes
Stress the importance of self-similarity to image inpainting and significantly improve the inpainting results by employing the strategy with Multi-Scale Self-Attention (MSSA)
Design a Back Projection (BP) strategy for obtaining the inpainting results with better alignments of the generated patterns and the reference ground truth images

Approach

Figure 2. Overview of the proposed model. There are two generators and two discriminators

The proposed model consists of two stages as shown in Figure 2.
Coarse Reconstruction Stage: The coarse generator G_1 takes M and I_in as input, and gives a coarse completed image I_coarse
Refinement Stage: The refinement generator G_2 is trained to decorate the coarse completed image with details and textures
Conditional Multi-Scale Discriminators: Two discriminators take input at two different scales to encourage better details and textures of the local reconstructed patterns at the two scales

Spatial Pyramid Dilation (SPD) residual block

Figure 3. Variations of residual block. (a) is a standard residual block; (b) is a simple dilated residual block; (c) and (d) are the proposed residual block with multiple dilation rates

As the scale of the masked areas are randomly determined, the authors propose to use multiple dilation rates to enlarge the receptive fields at each layer such that information given by distant spatial locations can be included for the reconstruction. Figure 3 graphically illustrates the design of the SPD residual block

Multi-Scale Self Attention (MSSA)

The Self-Attention block used in this paper is exactly the same as the Non-local block
The main idea of Self-Attention is to compute the self-similarity of the image itself and this is useful for amending the generated patterns according to the remaining valid pixels in a masked image
The authors apply MSSA instead of single SA to enhance the coherency of the completed image by attending the self-similarity at three different scales as shown in Figure 2. To avoid additional parameters, they simply use standard convolutional layers to reduce the channel size before connecting to the SA blocks

Back Projection (BP)

The authors also re-design the back projection strategy as shown in the shaded Back Projection region in Figure 2. They learn to weight the BP residual and add it back to update the final completed image, hence the generated patterns have better alignments with the reference ground truth images

Conditional Multi-Scale Discriminators

Two discriminators at two input scales are trained together with the generators to stimulate details of the filled regions. The discriminators output a set of feature maps and each value on these maps represents a local region in the input image at two different scales. This encourages both appearance and semantic similarity

Loss Function

There are five major terms in their loss function:

L1 loss to ensure the pixel-wise reconstruction accuracy
Adversarial loss to urge the distribution of the completed images to be close to the distribution of the real images
Feature Perceptual loss to encourage each completed image and its reference ground truth image to have similar feature representations as computed by a well-trained network with good generalization, like VGG-19
Style loss to emphasize the style similarity such as textures and colors between the completed images and real images
Total variation loss used as a regularizer to guarantee the smoothness in the completed images by penalizing its visual artifacts or discontinuities

Experiments

Random Mask Generation: Three different types of masks were used in the training (as shown in Figure 1., i.e. rectangular mask, free-form mask, and cellular automata mask). The authors applied these three types of masks to each training image to achieve more stable training
Two-stage Training: The training process is divided into two stages, namely a warm-up stage and the main stage. They first trained the two generators using L1 loss for 10 epochs. Then, they alternately trained the generators with the discriminators for 100 epochs
Training Data: They trained the proposed model on two datasets, say CelebA-HQ dataset (for face images only) and ADE20K dataset (a more general dataset which contains buildings, people, natural scenes, etc.)
Ablation Study: The authors first provide evidence to show the effectiveness of their suggested strategies and building blocks, namely SPD residual blocks, MSSA, and BP

Table 1. Ablation study on CelebA-HQ dataset. The best results are in **bold** typeface

Figure 4. Results from variations of the proposed model on CelebA-HQ dataset

The baselines are denoted as StdResBlk (Coarse only, only the first stage) and StdResBlk (a typical ResNet for Inpainting), for which all SA blocks and the BP branch are eliminated and all SPD residual blocks are replaced by the standard residual blocks as shown in Figure 3
For quantitative results, from Table 1, the employment of MSSA brings an 1.06 dB increase in PSNR compared to StdResBlk-SA (single SA). This reflects the importance of MSSA to image inpainting task
For qualitative results, Figure 4 shows the comparisons of the variations of the proposed model. Without the second refinement stage, the completed images lack for facial details as you can see in the 1st example of the 2nd column
Comparison with Previous Works: To test the generalization of the proposed model, the authors compared their model with some state-of-the-art methods on two publicly available datasets, FFHQ and Oxford Buildings

For quantitative results, Table 2 shows that their proposed model outperforms the other two methods in all the experiments in terms of the pixel-wise reconstruction accuracy (i.e. PSNR, SSIM, L1 err.). They also achieve better estimated perceptual quality (i.e. FID and LPIPS) in most of the scenarios

Figure 5. Qualitative results on FFHQ and Oxford Buildings datasets

For qualitative results, in Figure 5, it can be seen that DeepFill v1 and v2 fail to achieve satisfactory visual quality on the large rectangular masks as shown in the first and fourth columns. Note that the authors try to seek a balance between pixel-wise accuracy and visual quality. To show this, they also provide the predicted semantic segmentation results

Figure 6. Visualizations of the predicted semantic segmentation test results

It is obvious that their test results are semantically closer to the ground truth than that of the other two methods, see for example, the intersection of the newspaper and the lawn of the first two columns
Some extra test results are also available at their github project page

Figure 7. Test results on the AIM Extreme Image Inpainting Challenge 2020

Figure 8. more test results on the AIM Challenge test set

Conclusion

Recall that the authors propose three main strategies in this paper for image inpainting, namely Spatial Pyramid Dilation (SPD) residual block, Multi-Scale Self-Attention (MSSA), as well as Back Projection (BP). They also point out that a right balance of pixel-wise reconstruction accuracy and visual quality is required to avoid some strange generated patterns

Interested readers are strongly suggested to read the paper and visit their github project page for more details.

Personal Thoughts

From their ablation study, it seems that one stage network for image inpainting is also possible and further study on this should be done
High PSNR usually with blurry images and how this can be solved is crucial for winning the challenge