Introduction to Generative models for Image Inpainting and Review: Context Encoders

Published in

Analytics Vidhya

9 min readSep 29, 2020

Hello everyone. I am going to review a series of paper related to image inpainting. In this first post, I would like to give an introduction to image inpainting. You will know the objective of image inpainting, its applications, etc. Then, we will dive into the first generative model for image inpainting in the literature (i.e. the first GAN-based inpainting algorithm, Context Encoders). Let’s Start!

Objective

Very straightforward! We want to fill in the missing parts in an image. Just like what you can see in Figure 1.

Figure 1. An image with a center missing hole (left), a filled image (right) [1]

Applications

Remove unwanted parts in an image (i.e. object removal)
Recover corrupted images (can be extended to repairing of movies)
Many others!

Terminology

Given an image with some missing areas, we define

missing pixels/generated pixels/hole pixels: the pixels located at the areas to be filled.
valid pixels/ground truth pixels: opposite of the missing pixels. Just keep these pixels and these pixels help us to fill in the missing areas.

Conventional Methods

Given an image with some missing areas, the most typical traditional method of filling in the missing regions is to copy-and-paste.
The main idea is to search for the most similar image patches from the image itself or a large dataset with million of images, then paste them into the missing regions.
However, the search algorithm may be time-consuming and it involves hand-crafted distance measure metric. There is still room for improvement in terms of the generalization and efficiency.

Data-driven Deep Learning-based Approaches

Because of the success of Convolutional Neural Networks (CNNs) in image processing, many people have started to apply CNNs to their tasks. The power of data-driven deep learning-based approaches is that we can tackle our problems if we have enough training data.
As mentioned in above, image inpainting is to fill in the missing parts in an image. This means that we would like to generate something that doesn’t exist or has no answer. So, all the deep learning-based inpainting algorithms employ Generative Adversarial Networks (GANs) to produce visually appealing results. Why visually appealing? As there is no model answer to the generation, people prefer results with good visual quality and this is quite subjective!
For readers who may not know about GANs, I recommend you to google it first. Here, using image inpainting as an example, simply speaking, typical GANs consist of one generator and one discriminator. The generator is responsible for filling the missing parts in an image and the discriminator is responsible for distinguishing filled images from real images. Note that real images are images in good condition (i.e. without missing parts). We will randomly feed filled images or real images into the discriminator to fool it. Eventually, if the discriminator cannot judge whether an image is filled by the generator or it is a real image, the generator is able to fill in the missing parts with good visual quality!

The First GAN-based Inpainting Method, Context Encoders: Feature Learning by Inpainting

After a brief introduction to image inpainting, I hope that you at least know what is image inpainting and GANs (one kind of generative models) are commonly used in the field of inpainting. Now, we are going to dive into the first paper in this series. Are you ready? Let’s learn and have fun together!

Intention

The authors want to train a CNN to predict the missing pixels in an image. As we all know, typical CNNs (e.g. LeNet for handwritten digit recognition and AlexNet for image classification) consist of a number of convolutional layers for extracting features, from simple, structural features to high-level semantic features (i.e. earlier layers for simple features like edges, corners and later layers for more complex feature patterns, readers may refer to my previous post), the authors would like to make use of the learned high-level semantic features (also called latent features) to help fill in the missing regions.
Also, features learned for inpainting require deeper semantic understanding of the image. So, the learned features are also useful for other tasks such as classification, detection, and semantic segmentation.

Background

Here, I would like to provide some background information for readers,

Autoencoders: this is a kind of CNN structures that are commonly used for reconstruction tasks. Some may also call it as Hourglass model of structure because of its shape. For this structure, output size is the same as the input size and we actually have two parts, one is encoder and another is decoder, as shown in Figure 2 below. The encoder part is for feature encoding, aiming for a compact latent feature representation of the input; while the decoder part is for decoding the latent feature representation. We usually call the middle layer as low-dimensional “bottleneck” layer or just simply “bottleneck”, hence the entire structure looks like a hourglass. Let’s imagine that we input an image in good condition into this autoencoder. In this case, we expect that the output should be exactly the same as the input. This means a perfect reconstruction. If it is possible, the “bottleneck” is a perfect compact latent feature representation of the input. More specifically, we can use fewer numbers to represent the input (i.e. much more effective and it is related to dimension reduction techniques). So, this “bottleneck” contains almost all the information of the input (may including high-level semantic features) and we can use it for reconstructing the input.

Figure 2. A simple graphically illustration of Autoencoders structure (encoder-decoder structure) [2]

Context Encoders for Image Generation

Figure 3. Overview of the proposed Context Encoder [1]

Figure 3 shows the overview of the proposed Context Encoder. First, the input is the masked image (i.e. image with a center missing hole). The input is fed into the encoder for obtaining the encoded features. Then, the main contribution of this paper, Channel-wise Fully Connected Layer is placed in between the encoded features and the decoded features for getting better semantic features (i.e. “bottleneck”). Finally, a decoder reconstructs the missing parts using the “bottleneck” features. Let’s have a look inside their network.

Figure 4. Detailed architecture of the proposed network [1]

Encoder

The proposed encoder follows the AlexNet [3] architecture. They trained their network from scratch with randomly initialized weights.
Compared to the original AlexNet architecture and the Autoencoders as shown in Figure 2, the main difference is the middle Channel-wise Fully Connected Layer. If there are only convolutional layers in the network, it is no way to make use of the features at distant spatial locations in feature maps. To solve this issue, we can use fully-connected layers such that the value of each neuron at the current layer is depended on all the values of the neurons at the previous layer. However, fully-connected layer induces many parameters. For example, 4x4x512=8192 results in 8192x8192=67.1M parameters. This is difficult to train even on GPUs and the authors proposed Channel-wise Fully Connected Layer to tackle this issue.

Channel-wise Fully Connected Layer

Actually, the channel-wise fully connected layer is very simple. We just fully connect each channel independently instead of all the channels. Say for example, we have m feature maps with size of n x n. If standard fully-connected layer is used, we will have m²n⁴ parameters excluding the bias term. For channel-wise fully-connected layer, we have mn⁴ parameters. Therefore, we can capture the features from distant spatial locations without adding so many extra parameters.

Decoder

For the decoder, it is simply a reverse of the encoding process. We can use a series of transposed convolutions to obtain the reconstructed image with the desired size.

Loss Function

The loss function used in this paper consists of two terms. The first term is a reconstruction loss (L2 loss) which focuses on the pixel-wise reconstruction accuracy (i.e. PSNR-oriented loss) and always results in blurry images. The second term is an adversarial loss which is commonly used in GANs. It encourages closer data distributions between the real images and the filled images.
For readers who are interested in the loss function, I highly recommend you to read the paper for the equations. Here, I just verbally describe each loss term.

Reconstruction Loss (L2 Loss) [1], M(hat) indicates the missing regions (1 for missing parts, 0 for valid pixels), F is the generator

L2 loss: they compute the L2 distance (Euclidean distance) between the generated pixels and the ground truth pixels from the corresponding real image. They only consider the missing region as shown in Figure 4.

Adversarial Loss [1], D is the discriminator. We want to train a discriminator that can distinguish filled images from real images

Adversarial loss: the structure of the adversarial discriminator is shown in Figure 4. The output of the discriminator is a single binary value either 0 or 1. 1 if the input is a real image while 0 if the input is a filled image.

Joint Loss. Lambda_rec is set to 0.999 while Lambda_adv is set to 0.001 in their paper

Both the generator and discriminator are trained alternately using Stochastic Gradient Descent (SGD), Adam optimizer.

Experimental Results

Two datasets are used in their evaluation, namely Paris Street View [4] and ImageNet [5].
The authors first show the inpainting results, then they also show that the learned features can be transferred to other tasks as a pre-training step.

Semantic Inpainting

Figure 5. Inpainting results [1], First 3 rows are results from ImageNet dataset; Bottom 2 rows are results from Paris StreetView dataset

Figure 5 shows the inpainting results using the proposed context encoders.

Table 1. Pixel-wise reconstruction accuracy for the Paris StreetView dataset [1]

The authors also compared with the conventional nearest neighbor (NN) inpainting algorithm. Obviously, the proposed method outperforms the NN inpainting method.

Figure 6. Inpainting results using different methods [1].

Figure 6 displays the inpainting results using various approaches. We can see that L2 loss tends to give blurry images (2nd column). L2 + Adversarial loss gives sharper filled images. For NN-Inpainting, they just copy and paste the nearest image patches into the missing region for comparison.

Feature Learning

Figure 7. Context Nearest Neighbours [1].

To show the usefulness of their learned features, the authors try to encode different image patches and report the nearest neighbours based on the encoded features. In Figure 7., the authors compare with conventional HOG and a typical AlexNet. They achieve similar performance to AlexNet, but AlexNet is pre-trained on a million images labelled dataset.

Table 2. Quantitative comparison for classification, detection and semantic segmentation [1].

As you can see in Table 2, models pre-trained on ImageNet have the best performance but expensive labels are required. For the proposed method, context is the supervision used to train the models. This is what they call feature learning by inpainting. It is clear that their learned feature representations are comparable or even better than other models trained with auxiliary supervision.

Conclusion

The proposed context encoders are trained to generate images conditioned on context. They achieve the state-of-the-art performance in semantic inpainting.
The learned feature representations are also useful to other tasks such as classification, detection and semantic segmentation.

Takeaways

I would like to highlight some of the points here. The points are useful for the future coming posts.

For image inpainting, we must use the “hints” from the valid pixels to help fill in the missing pixels. The term “context” relates to the understanding of the entire image itself.
The main contribution of this paper is the Channel-wise Fully Connected Layer. Actually, it is not difficult to understand this layer. For me, it is an early version/oversimplified version of Non-Local Neural Networks or Self-Attention. The main point is that all the feature locations at the previous layer contribute to each feature location at the current layer. In this point of view, we will have much deeper semantic understanding of the entire image. This concept has been extensively adopted in the later papers!
To the best of my knowledge, all the later inpainting papers follow the GAN-based structure (i.e. encoder-decoder structure). People target at filled images with good visual quality.

What’s Next?

Next time, we will look into another paper that is an improved version of the Context Encoders! I hope that I can show you the progress in the field of image inpainting. Click here for the next post!

References

Deepak Pathak et al., “Context Encoders: Feature Learning by Inpainting,” https://arxiv.org/pdf/1604.07379.pdf
Hourglass picture from https://www.123rf.com/photo_65737761_stock-illustration-illustration-of-an-hourglass-with-sand.html
AlexNet, https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
C. Doersch, S. Singh, A. Gupta, J. Sivic, and A. Efros. What makes Paris look like Paris? ACM Transactions on Graphics, 2012.
ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.

Thanks for reading. If you have any questions, please feel free to leave comments :) Thanks again! See you next time.