Context Encoder — Image inpainting using GAN

Tomáš Halama
knowledge-engineering-seminar
8 min readMay 2, 2020

Context Encoder [1] is an architecture used for image inpainting, which has proved to have a high potential and was used as a foundation for many other advanced and more robust approaches to image inpainting. In this blog, I aim to present you with what image inpainting is, how it can be useful, and what are the crucial ideas that Context Encoder presented.

Image inpainting, what and how?

One of the problems that keep emerging ever since the discovery of a camera is that images can become damaged in various ways, and it’s often required to repair them. The task of image inpainting requires us to fill in a specified region in an image based on the rest of the picture. Historically this would be done by a professional artist, who may spend hours or even days restoring a single image or painting. Thankfully, due to recent advances in computer science, there are other methods that can aid us when restoring image data. Besides, we can use inpainting techniques to fix various image anomalies.

These types of data anomalies include but are not limited to blurred areas, watermarks, unwanted objects or even widespread noise. To perform successful image inpainting, we need to provide a seamless and plausible replacement for a specific region of pixels in the image.

If you have ever used a tool like Adobe Photoshop to retouch photos, you might just get the right idea. Removing imperfections from a person’s face is the same thing as inpainting the area where the imperfections are!

Retouching in Adobe Photoshop

Unfortunately, most of the standard methods used to perform computer-aided inpainting (such as Adobe Photoshop) rely on local features such as colours and textures, but they fail to consider the global semantics of the image. Results of these methods work well for cases where image corruption is minor or straightforward to fill in, but not that well for cases with more significant damage, failing to produce reasonable or plausible outcomes [1]. When presented with a solution that respects semantics of the image, we can e.g. generate an entire face of a person based on an outline of the head. This is not easily done through standard algorithms, that are used daily by graphics designers in tools such as Adobe Photoshop. As a consequence, we can see there might be a desire for a solution that can do such a feat.

Comparison of a neural network approach and mainstream approach. Source: [1]

GAN

A significant number of state-of-the-art methods use generative deep neural networks and their results look very promising. One of the ways to generate globally well-organized and coherent images is to introduce a second neural network, an adversary that tries to decide whether the generated results look artificial or genuine. The original generating network can learn to produce results that are much less likely to be discarded as artificial using information from this adversary network. These two networks are called a generator and a discriminator. This architecture, called Generative Adversarial Nets (GAN), was proposed in 2014 in [3].

GAN architecture. Source: https://www.slideshare.net/xavigiro/deep-learning-for-computer-vision-generative-models-and-adversarial-training-upc-2016

Denoising Autoencoder

In addition to GAN, to make the architecture of Context Encoder and its basic ideas simpler to understand, in the few following paragraphs, we will quickly present the concept of an autoencoder and put an emphasis on a model called Denoising Autoencoder. This hopefully makes the transition to understanding the core ideas behind Context Encoder simpler.

An autoencoder is a type of neural network that attempts to replicate its input x on its output. It consists of two components, an encoder E and a decoder D. Encoder’s output is a vector z = E(x) with a lower dimension, also called a latent space vector. There are some uses for a higher dimensional latent space, but we shall neglect the case for our purposes. We do not want the network to learn an identity function D(E(x)) = x, but rather make it produce an approximate copy, which still holds the same properties as original input [2]. As a consequence, the network is encouraged to learn the most important and potentially useful feature representations in the latent space.

The training process minimizes a loss function L, which measures how different two data points are. Depending on the usage, different loss functions can be used in place for L. Some of the most popular are mean-squared error and cross-entropy. Respecting the notation of the previous paragraph, the training process aims to minimize the expression L(x, D(E(x))).

Denoising autoencoder. Source: https://www.pyimagesearch.com/2020/02/24/denoising-autoencoders-with-keras-tensorflow-and-deep-learning/

Based on the principles behind autoencoders, there was a slight modification presented that enabled us to recover partially damaged data by changing the criteria for reconstruction. Instead of learning to replicate the input directly, the input is damaged beforehand and the network’s output is trained to approximate the undamaged original image. In order to create a deep architecture, the authors stack multiple encoder/decoder pairs, by chaining multiple autoencoders so that each encoder’s output is the input to the next. The same thing applies for decoders while respecting the order of encoders. The motivation behind this architecture is to have each latent layer represent more abstract features, similar to how most other deep neural networks work. This type of network is called Stacked Denoising Autoencoders (SDA) [4]. The process of denoising training together with the usage of deep layers enable the model to build a well-structured and robust hidden layer of features, from which it is possible to reconstruct the original image [2].

Context Encoder

Finally, the Context Encoder framework [1] is based on an autoencoder architecture with an adversarial discriminator, while exclusively using convolutional layers with varying kernel sizes and channel counts. Similar to denoising autoencoders, we do not try to reconstruct the input image, but rather generate a patch to fill in the missing or damaged area of the input.

The architecture of Context Encoder. Source: [1]

The encoder consists of 2D convolutional operations, with a progressively increasing number of channels. Analogously, the decoder is made up of 2D transposed convolutions (sometimes also called deconvolutions). Convolutional operations have the effect of decreasing the width and height of the input while extracting features into deeper channel maps. The transposed convolutions work in an opposite manner, they upscale the image and typically decrease the number of channels. The intermediate bridging layer between these two sections is called the bottleneck layer and is meant to represent the encoded context of the image, hence the name Context Encoder. The size of this layer is dataset dependent and impacts the network’s ability to encode the semantics of the input image. The bottleneck layer does not need to be as size-restricted as when training a regular autoencoder. This is because we do not have to prevent learning the identity function, as we do not directly reconstruct the input.

The discriminator has an image input and it is tasked with classifying it either as genuine data or an inpainting result. The network outputs its opinion in a single scalar representing a probability. For arbitrary (or random) region damage, an image of the same size as the original undamaged sample is on the discriminator’s input. For the case of square region damage, the authors decided to use only the inpainted patch on discriminators input instead of the entire composite image. One of the reasons for this architectural design was the fact that the discriminator could fail to learn useful features and only manage to learn to recognize a boundary of the area where inpainted data were inserted. Overall patch-only evaluation results in lower computational requirements and in turn a shorter training time.

Similar to vanilla autoencoders, all training is performed in an unsupervised manner, meaning no class label information is supplied to the model and there is no conditioning for the evaluation. As a consequence, for significantly damaged images the model might perform successful inpainting, while differing in the class characteristic content.

The reconstruction loss, P is the original region before damaging, CE is the model and X’ is the entire image that needs to be inpainted.

The entire model is trained using two loss functions. The first one is the reconstruction L2 loss between the original data and the inpainted result. If we stuck to using only the reconstruction loss, the results would be blurry and very easily discarded as fake. This is likely caused by the network approximating an average patch of all the possibilities for a plausible inpainting result, instead of picking a concrete sample. This issue is alleviated by the use of the aforementioned adversarial discriminator, which learns to identify features that are specific only to generated patches. Through the discriminator’s opinion, the generator is pushed to produce sharper results instead of uncertain averages. The adversarial loss is joined with reconstruction loss and both are weighed with the respective coefficients.

The overall loss, lambdas are the coefficients for tuning the influence of each of the losses. The adversarial loss is analogous to the loss presented in [3].

Verification of the results, follow-up work

The results of a simplified implementation of Context Encoder (CE), done for the purposes of my bachelor thesis [5].

Once the model is trained, the results are very promising. I implemented the model as a part of my bachelor thesis [5], during which I gained a lot of valuable insights, as it was a significant shift from my previous programming experience. The model I implemented was partly simplified, which might have had its effect on the sharpness of the results.

Nevertheless, it is very obvious that the Context Encoder provided a basic architectural design, which inspired many other works. The Context Encoder is often used as a baseline for comparison of the quality of other models for the image inpainting task. These more recent works further improved the consistency, image quality, and user-guided influence on the results. For demonstration, I picked one follow-up work [6] to demonstrate the quality and features, which can be achieved using a similar framework.

Results achieved in a follow-up work [6], loosely based on Context Encoder architecture. The authors managed to let the user guide the inpainting process.

Most of the text presented in this article comes from the survey sections collected for my yet unpublished bachelor thesis [5]. In the scope of it, I have reimplemented the Context Encoder model and measured its performance in comparison with other methods.

[1]: PATHAK, Deepak, et al. Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 2536–2544.

[2]: GOODFELLOW, Ian; BENGIO, Yoshua; COURVILLE, Aaron. Deep learning. MIT press, 2016.

[3]: GOODFELLOW, Ian, et al. Generative adversarial nets. In: Advances in neural information processing systems. 2014. p. 2672–2680.

[4]: VINCENT, Pascal, et al. Extracting and composing robust features with denoising autoencoders. In: Proceedings of the 25th international conference on Machine learning. 2008. p. 1096–1103.

[5]: HALAMA, Tomáš. Image Inpainting Using Generative Adversarial Networks. Unpublished bachelor thesis. Czech Technical University in Prague, Faculty of Information Technology, 2020.

[6]: YU, Jiahui, et al. Free-form image inpainting with gated convolution. In: Proceedings of the IEEE International Conference on Computer Vision. 2019. p. 4471–4480.

--

--