- Introduction
- Image Inpainting
- Sample Coding for Image Inpainting
- Application and Theory behind Image Inpainting
- Generative Adversarial Network
- Different Image Inpainting Algorithm
- Limitation of GAN Inpainting
- Conclusion
1. Introduction:
Whatever we see in our entire life with our own eyes are been registered in the world’s biggest hard disk that we call as Human Brain. There will be many things that we seek to remember and few things we do not. When years are passing and we age, our ultimate hard disk will be slowly deteriorating, and we fail to remember many important things in life. Many decade ago, technology played an important role in capturing our important moments in life, which indeed called as camera.
People are very much fond of this tech, and started capturing many events of life that could be a marriage, a vacation trip with family or friends, gathering with friends after a long time, the first smile of your child when he / she is born and many more in the form of images or picture. Nowadays we save hundreds or thousands of images every year. We wish to see them and recollect those precious times even though it was taken long time ago.
No one knows how Einstein will look like, unless either his image is captured, or it drawn by any professional artist. Even though, history tells us many things in words or text, the art drawn by the artist at different period of times gives life to our imagination. By seeing those art, we could be able to imagine the living style or the habitat during those periods. So, are we preserving those valuable images which tells about our ancestors?
Now imagine what happens when these images that are most beloved to us are damaged by any means like scratches in the image, partial images were torn, some regions got faded and so on. How many times we would have thought if there is anything that could repair my damaged image which make our day beautiful. OK wait, why are we talking too much about only the old images alone? Is that all we care?
Let us think another scenario then. Imagine a breezy day, you have planned to go to the beach in an early morning time, along with the love of your life. You are trying to capture that romantic moment by taking a selfie. After you reach home, you noticed in that selfie image there are some people behind you at some distance or there are some unwanted objects in that image which you do not need. Relax. No worries!!
2. Image Inpainting:
For all those different kinds of scenario, which we were speaking previously, we have a good solution for it and that is called as Image Inpainting. So now let us give a formal definition for image inpainting. Image Inpainting is an art or process of filling the lost information in an image or to remove the unwanted or damaged portions from an image in a realistic manner.
This article is organized in a way that we will make our hands dirty by implementing simple inpainting algorithm from scratch in section 3, application and theory behind the inpainting algorithm in section 4, understanding basic architecture of GAN in section 5, then analyzing some popular GAN based inpainting algorithm in section 6, some limitation of the algorithm in section 7, and finally conclude in section 8.
Before going deeply about the theory and the application of image inpainting, let us make our hand dirty in implementing the solution by using latest deep learning architecture.
3. Sample Coding for Image Inpainting:
Let us see some sample coding for inpainting the images using deep learning based GAN methodology. Here we go!!! The code is already committed in GitHub repository for both training and testing the inpainting algorithm from scratch. You can download the link from here RenithInPainting.
We are going to implement a simple inpainting algorithm by training the model from scratch and by applying regular / square masking to the image. Finally, we will see how to test a damaged image for inpainting by applying our trained model.
We are going to implement our inpainting algorithm using PyTorch framework. PyTorch has become recently popular because of its ease of use. From Line 1 to Line 17, we are importing all the necessary python packages along with PyTorch machine learning library for implementing our Image Inpainting model.
From Line 19 to Line 25, we have initialized some essential hyperparameter macros needed for our implementation. These hyperparameters constitute of total number of epochs, batch size, name of the dataset, input image and mask size and the sample interval for retrieving the reconstructed sample output images. You can play around these hyperparameters value by fine tuning it and train the model to see how the output varies with distinct values. Place your training images under the DATASET_NAME folder.
From Line 27 to Line 29, creating an output folder where our inpainted sample output will be stored. This helps us to verify and see how the generated inpainted samples are been improved in every iteration or epoch. You can see that the results stored in this folder are getting better at higher epochs or iterations.
From Line 31 to Line 35, verifying whether our system has enabled with CUDA or not. If CUDA is enabled in our training system, then it prints as “CUDA”, otherwise as “CPU”. It is advisable to run this code in CUDA supported environment due to the code complexity. For training with single neural network, we know how difficult it to train in a CPU environment when the training dataset is bigger. But in our case, we have two different neural network training in parallel mode which makes our training environment more complex than normal. No worries if we do not have a GPU support system. There are many open GPU available for free and one among them is Google Colab.
From Line 38 to Line 44, we are creating a function “weights_init” to initialize the weights and bias value for our model. The initialized weight and bias value that is been assigned has been considered since most of the researchers got good improved result from this initial weight and bias value.
From Line 46 to Line 50, we are initializing the loss function for our model. We have taken L1 loss and the Mean Square Error loss as our loss function for our programming. This loss function has given good result in reconstructing the output in-painted image for our dataset. But you are welcome to try and play around by substituting with any other loss function. Overall, this inpainting technique is still an open research to everyone. Finally, this assigned loss function variables are been converted to CUDA supported variables.
Now let us try to construct our both Generator and Discriminator module which are considered to be the heart of our Inpainting algorithm.
From Line 52 to Line 90, we are building our Generator model for our inpainting solution. Our Generator model has both up-sampling and down-sampling layers. From Line 58 to Line 71 are the layers constructing the down-sampling method i.e. convolution operation is performed to extract the important features from the input sample. Line 72 is the latent space vector. Latent space vector is nothing but hold the high-level feature information about the input samples. Line 73 to Line 86 are the layers constructing up-sampling. Here Deconvolution operation is performed to generate samples from the high-level information present in the latent vector. We are using LeakyReLU activation function for the down-sampling layers and ReLU activation function for the up-sampling layers.
From Line 93 to Line 113, we are constructing our Discriminator model for our inpainting solution. We are using LeakyReLU activation function for the discriminator model. This Discriminator setup is a proven method for the inpainting method. But you are also welcome to play around with the layers and other parameter for your research. This is the model which tells that the generated sample is a real guy or a bad guy.
From Line 115 to Line 121, we are initializing both class object for Generator model and Discriminator model. Line 124 to Line 125, we are calling the weight function that we created previously and apply it to both the generator and discriminator model by initializing the weight and bias value for the neural network model.
Line 128 to Line 179, we have created a class called “InputDatasetLoader”. This created class helps for loading and processing input image dataset. From Line 128 to 140, we have created a “__init__” function that will initialize all the required variables while creating object for this class. Yes, it acts like a constructor. So, there is no need to call this function separately. Line 142 to Line 151, we have created a function “random_regular_mask” that creates a masking image for the input image. Here it creates a random regular masking depending on the mask size we have provided in the hyperparameter. Line 154 to Line 163, we have created similar mask function “center_mask” as before. But here it creates the masking at the center of the image. The masking size is like the hyperparameter we have provided before. From Line 165 to Line 176, this function “__getitem__” helps to read the input image. Performs the masking operation by calling the corresponding mask function and returns the masked image.
From Line 181 to Line 204, we are initializing Dataloader for both training and testing image data by mentioning the input dataset image path. The length of dataloader size for both training and testing data can also be verified.
From Line 206 to Line 210, we are initializing the optimizer for both generator model and discriminator model. We have used Adam optimizer for both generator and discriminator model. From Line 212 to Line 225, we have created function “store_sample_image” for saving the generated sample output to disk at regular sample interval.
From Line 228 to Line 290, we are training our both generator and discriminator model for the number of times of epoch value. The loss value is been calculated for both the generator and discriminator model for each epoch. The model adjusts the weight based on the backpropagated error value. We can visualize the generated inpainting samples at each regular interval “SAMPLE_INTERVAL” by saving it to the disk. We can verify how our model is improving at higher iteration with good inpainting results. Then finally we are saving our trained model with the checkpoint. We can use this saved trained model for any future inpainting of new images.
Now we have seen how to perform model training for an image inpainting algorithm from scratch. Let us now see how to test and apply our trained model to reconstruct a damaged image.
From Line 292 to Line 307, we are initializing the model function and the optimizer. Then loading the trained model checkpoint and applying evaluation mode. Transform function is initialized for the input test image.
From Line 309 to Line 313, we are loading our test image named ‘Damaged_Image.JPG’ and applying the transform to the image. From Line 315 to Line 323, we are applying the regular masking for the test image with white pixels.
From Line 325 to Line 336, we are testing our test image with the trained model which outputs the resultant in-painted image. Finally, we are storing the result “Inpainted_Output.png” in the disk.
Here are some of the results from our trained model.
This output results are from our 200th epoch. For more higher number of epochs you could be able to visualize more better results. For Paris Street View dataset, consisting of 14900 training images, it took around 3 hours of training time in 8GB NVIDIA GTX 1080 GPU with 16GB RAM for 200 epochs. The execution time of test image with 128x128 size resolution for inference is approx 3 seconds.
Paris Street View dataset was created by the authors of “Context Encoders” and it is not available in public. If you need the dataset, then please mail to the author Deepak Pathak
Congrats!!! You have successfully built an Image Inpainting algorithm. Now enjoy the inpainting results you have obtained from the above model training.
4. Application and Theory behind Image Inpainting:
The application of Image Inpainting can be broadly categorized into three main ways as follows,
1. Removing any unwanted text overlay or minor scratches in the image
2. Removing any unwanted objects from the captured image
3. Restoring the highly damaged photograph with realistic completion
When a human sees a damaged picture, his retinal sensory organ will pass the information of partial visible region of the image to the human brain. The brain will try to recreate the image that could be possibly present in the damaged portion of the image. Now let us try to understand how the human brain can recreate the visual information present in that damaged region of image. This is because the human brain records all the visual information around us in our entire life. It learns everything about our environment from the day one. So, when the human brain sees a partial damaged photo, it could be able to recreate the missing data from the previously learned information. Whether our image inpainting algorithm has the same characteristics of human brain that can recreate the image with more realistic? Let us dive into detail.
The traditional method of image inpainting is mainly focused on patch-based method and diffusion-based method. It normally reconstructs the damaged part of image based on the boundary region pixels. Some methods copy and paste the patches from the nearby damaged area. These approaches are good when the damaged region in the image are very small, since it does not distinguish between foreground and background and just takes data of neighboring pixels. In other words, these algorithms can do well if there is a small scratch in the image and the background is constant. But if there is a bigger damage such as nearly half of the image information is lost, then these traditional methods would not be a good choice for you.
Recent development in image inpainting has improved a lot and researchers has proved that these algorithm has the same characteristics of human brain which could able to reconstruct the damaged portion of image in a realistic manner, even though the region of missing information is large. This is only due to the advancement of Generative Adversarial Network popularly called as GAN which is a deep learning algorithm that try to imitate the human brain characteristics. Now before diving into the workflow of GAN algorithm, let us see some real example output of GAN algorithm that recreates the damaged image like our human brain.
The left image is the original image with scratches on it. The damaged area is little bit higher near the face region compared to other area. Center image is the mask image in which we tell the algorithm the region to be reconstructed with white pixels. The right image is the final reconstructed output image which uses our deep learning-based GAN algorithm. Isn’t it nice right? The reconstructed image is like the one if a professional human artist has drawn the same. Okay hold on !! whether only reconstructing a small scratch, the GAN is capable of? Not really. The GAN can remove the text overlay in the image, remove unwanted object from the original image in a realistic way, recreate the highly damaged region with more plausible cases and much more. Let us see some example below.
The above image explains how the text overlaid in the original image is been removed and reconstructed that region in the output image in a realistic way.
The above example shows how the resultant image is reconstructed in a realistic way even though a higher amount of major portion in an image was damaged.
The above example image shows how the unwanted object from an image can be removed in a realistic way without affecting the foreground and background of an image. Now let us dive into a high-level theory of GAN like how it could be able to perform these operations and how it differs from other neural network methods.
5. Generative Adversarial Network:
Generative Adversarial Network is aka ‘GAN’, is a type of network by which it could able to create its own images that never exist in the real world before, whereas other neural network tends only to classify or predict the target object alone. Not only it creates a realistic image, but also it can create video, cartoon characters, emojis, image to image translation, new human poses, remove noise and turn image to high resolution, generates 3D model and so on. The generated images from the GAN could also be used as an additional augmented data for training the deep learning model. So now the question is, how these GAN work?
GAN’s generally have two major components. They are Generator and Discriminator. Generator plays the toughest role in the GAN. Generator tries to generate the images. Discriminator verify the generated image and decide whether it is likely to the images that discriminator has seen before. Discriminator judges whether the generated image from the generator is a real one or fake one. Based on discriminator decision, generator will act accordingly. So, we are going to train both these generator and discriminator from scratch. It is not like before training generator, the discriminator knows everything about the real world. Both the generator and discriminator will be trained parallelly and learn everything about the real and fake objects from zero knowledge. This is because, suppose if the discriminator is trained well on the real-world images, then it will not give a chance for generator to generate the new image which is not real at the initial stage. Also, generator cannot be able to fool the discriminator since it knows everything how real object looks like. For this reason, we are training both generator and discriminator from scratch.
Let us say that generator is trying to generate a face image of a person and gives to discriminator. Now discriminator will investigate the real face images and examine how it looks like. Discriminator understands that face image has some components like eye, ear, nose, lips etc. Now discriminator compare the generated output from generator and verify that how much probability percentage it matches with the real face images. Then discriminator gives feedback to the generator that the generated image is missing some component in face such as no eyes and nose. With this discriminator feedback, generator adjust itself and try to be more perfect from next time. This way, the generator learns how the real image look like and will be more robust each time to fool the discriminator with newly generated images that really do not exist. At the same time, now discriminator knows how the fake image will look like and how it differs from real images. In this way, discriminator will be more robust from next time and adjust it weight internally for comparing real and fake images. So normally it looks like this both generator and discriminator are against each other, but they both are teaching each other at the same time. So basically, GAN are trying to learn about our world like the human brain and creates object that are really like the real world but never actually exist. With this thought in mind, let us imagine and see what the expectation we had previously and what the reality it behaves.
Let us imagine this with our famous Tom and Jerry cartoon. Hope everyone knows about this character well. Here Tom, the cat, always monitor and fight with Jerry, the mouse. Let us consider Tom to be discriminator and Jerry to be Generator. So, always Jerry try to fool Tom by stealing the cheese. And Tom always find where the cheese is, and chases Jerry to go outside the house. Same way, we thought that this generator and discriminator fights each other, and that was our expectation too.
But the reality is, both the generator and discriminator are helping each other by understanding the real world and starts learning by sharing the information at the same time. Let us visualize how it looks in reality.
The visualization of Tom and Jerry is created only to understand the concept behind the GAN. We do not hold any rights in the picture/image of Tom and Jerry. If any copyright owners have any objection, then it can be removed accordingly.
Speaking more about in some technical way than what we discuss above the operation of GAN in a subjective way, let us see here how it works. Initially generator takes the input of a random noise signal, since it does not know how the real image look like. Generator generates a random noise sample and gives it to discriminator input. Discriminator takes the generated samples of generator along with the original images as input. Discriminator gives a probability value for both real and generated samples. Then discriminator calculates the error from both the probability. With the calculated error value, discriminator backpropagates through the network and updates its weight. The generator takes the probability value of generated samples from the discriminator output and calculate the error like how much it differs from the real images. With this calculated error, the generator backpropagates through its network and update its new weight to fool the discriminator again. This process is done iteratively unless generator could be able to generate good images and discriminator learns more about fake images and improves by itself in judging the real and fake images.
So, now we have a clear picture of how GAN operates and what it is capable of in generating new realistic images. Now what next? Yes, we are going to see how this GAN is useful in Image Inpainting application and how it reconstructs a damaged image in an efficient manner. We are going to see few algorithms that are recently developed for image inpainting application by using GAN and how the architecture varies to achieve our desire result.
The additional image we are going to give to the GAN architecture for image inpainting other than the partial damaged input image is the Mask image. This mask image tells the algorithm that in which region of the image the reconstruction should perform. By this way, the algorithm only concentrates on those masked regions for reconstruction apart from concentrating the entire input image. With the knowledge gained from the real-world images, this algorithm will try to reconstruct the image based on the visible region of the damaged input image. But remember, this algorithm not only reconstruct the damaged region, also it has the capability to remove the unwanted object from the input image in a realistic manner.
Let us analyze some recent research papers which yields good result in reconstructing and made a remarkable turning point in the image inpainting technique by providing realistic output.
6. Different Image Inpainting Algorithm:
In this section we will discuss some high-level operation of popular image inpainting algorithm which yield better result with realistic outputs. Let us dive into it.
6.1. Context Encoder:
Context Encoder is a quite common inpainting technique which follows the GAN framework but inbuilt with Autoencoder architecture in the Generator section and with an adversarial Discriminator section. The general architecture of context encoder is shown below,
In general, autoencoder tries to encode the input data with high level feature information about the input image and try to decode the encoded information in an assumption that it has generated images like the input image. The encoded data of the input image are been stored in the latent space vector. Although it does not have information about the entire input image, it has good knowledge of how the input image look like. From this latent space vector, the decoder will start decoding the vector feature information and try to recreate an output image like the input data. The loss function plays an important role in calculating the error between the original and the generated samples. Mean squared error and Cross Entropy error are the two popular loss function been used. So, for our use case in image inpainting method, here it tries to reconstruct the damaged portion of the input image.
Here the encoder forms a two-dimensional convolution operation whereas the decoder forms a two-dimensional deconvolutional operation. The convolution operation normally downscales the image by extracting the image features and the deconvolutional operation is used to upscales the image data. The middle layer between the encoder and the decoder is called as the bottleneck layer and generally it represents the encoded context of the image. So now we know why we call this as a Context Encoder.
The Discriminator gets the generated samples from the decoder output and also get the input of the real patch region. Now the discriminator will be able to decide whether it is a real or fake image. Here since we are giving only the damaged patch region of input rather than a full complete input image to the discriminator, the complexity of training will be less, and the training will be performed quickly.
6.2. Pluralistic Image Completion:
Pluralistic Image completion is one of the recent good approach among the other image inpainting technique which was released in recent years. In most image inpainting solution, the algorithm produces only one result for the single mask image. But there may be different solution for filling the damaged portions of the image in a realistic manner. This diverse plausible solution for inpainting the damaged region with single mask is been proven by Pluralistic Image Completion algorithm. The general architecture of Pluralistic Image Completion algorithm is shown below.
Here the author uses a novel probabilistic framework with two parallel pipelines. One is called as the Reconstructive path and the other is called as the Generative path. Reconstructive path utilizes the single given ground truth image to retrieve the prior distribution of the missing region. Now it rebuilds the original image from this distribution. For the generative path, the conditional prior is coupled to the distribution obtained in the reconstructive path. Both these above-mentioned paths are been supported by GAN.
The author also introduces a novel based short + long term attention layer that exploits distant relations among decoder and encoder features by improving the appearance consistency. Other inpainting algorithm will generate some blurry level reconstructed samples, but this novel method tends to give a good visual consistency at the output result.
The Reconstructive pipeline combines information from masked region and its complementary region which is used only for training. The Generative pipeline infers the conditional distributions of hidden regions, that can be sampled during testing. This will be significantly less accurate than the inference in the upper path. Both these networks share identical weights.
They have named their model as PICNet. This PICNet has the capability to produce a diverse plausible result for a single mask image. This tends to reconstruct the output image with different realistic solution. One example of reconstructing damaged region with different result is as follows,
There are even more algorithms based on Image Inpainting technique which could able to produce more realistic images and definitely you need to explore other algorithm for better understanding of operation of inpainting that how researchers finding a different way to produce a good solution.
7. Limitation of GAN Inpainting:
GAN has become more popular in recent days and researchers are trying hard to utilize the power of GAN in different applications. Even though, GAN has given powerful result in some cases, it is still an open research and people are finding the way to improvise it. Even the evaluation metric for GAN is a hot topic research and people are finding and proposing different algorithms for it. Like object detection and classification problem, we do not have a constant evaluation metric for GAN for verifying the generated samples against the original samples. Also, since we are dealing with training two neural networks at the same time, the model training takes more time compared to other neural network architecture.
Image Inpainting in GAN is also a hot research topic to reconstruct the image similar to the human brain. Even though it performs pretty well in many scenarios, it also fails to reconstruct the image in other cases too. One example for possible failure in reconstruction is as follows,
In the above figure we can verify that even though the full view of person is been masked in the original image, the algorithm still outputs a person face in the final resultant image. But No worries!!! There is still more way to go!!!
8. Conclusion:
Image Inpainting is really a good technique that can recreate the damaged pixels in the image in a more realistic manner. Also, it has the capability to remove any unwanted objects from the input image in a more pleasing visual without affecting its environment. There is still a good amount of research is going on where researches from different parts of the world are trying to create a good solution for image inpainting problem.
Hope this article gives you a good overview of the image inpainting technique, and how to implement an image inpainting solution with a python coding quickly to make your hands dirty, and finally describing with some popular image inpainting algorithms.
9. References:
[1]: GOODFELLOW, Ian; BENGIO, Yoshua; COURVILLE, Aaron. Deep learning. MIT press, 2016.
[2]: PATHAK, Deepak, et al. Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. p. 2536–2544.
[3]: Chuanxia Zheng, et al. Pluralistic Image Completion. Published 2019 Computer Science 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)