Facial Reconstruction using Conditional Generative Adversarial Network
I recently completed a deep learning project which involves reconstructing occluded facial images using a Conditional Generative Adversarial Network (CGAN) so I thought I should write about it.
This project is based on the Pix2Pix paper, which explores conditional adversarial networks for image-to-image translation tasks.
Dataset
The primary dataset used in this project is the CelebA dataset. The CelebA dataset is a large-scale face attributes dataset with more than 200K celebrity images, each with 40 attribute annotations. The images in this dataset cover large pose variations and background clutter.
What I did was to generate different types of occlusions for 20,000 images in the CelebA dataset. The occlusions include: nose mask of different color and the Gaussian blur. The occlusion also include partial and full occlusion of the face. The occluded images were used as input to the Conditional Generative Adversarial Network (CGAN) to generate the reconstructed images.
Training
The generator model I used in the CGAN is U-Net while the discriminator model is PatchGAN. During training, the generator and discriminator models are trained in an adversarial manner. The generator aims to generate reconstructed images that are indistinguishable from the original images, while the discriminator aims to correctly classify between real and generated images. This adversarial training process helps the generator improve its ability to generate high-quality reconstructed images.
Evaluation
The trained model was evaluated using structural similarity index (SSIM) and peak signal-to-noise ratio (PSNR). The SSIM is a method used to evaluate the similarity between two images. The SSIM value ranges from -1 to 1, where 1 means the two images are identical. The PSNR is a method used to evaluate the quality of the reconstructed image. The PSNR value ranges from 0 to infinity, where 0 means the two images are completely different and infinity means the two images are identical. The SSIM and PSNR values for the reconstructed images are shown below. The SSIM value is 0.8534142374992371 and the PSNR value is 22.33269500732422. The SSIM value is close to 1 which means the reconstructed image is very similar to the original image. The PSNR value is 22.33 which means the reconstructed image is of good quality.
Result
The result of the facial reconstruction using the Conditional Generative Adversarial Network (CGAN) is shown below. The first image is the occluded image, the second image is the original image and the third image is the reconstructed image. The reconstructed image is generated by the CGAN model. Even though there are some variations in the results such as facial emotion i.e. some people might be smiling in the original image but not in the reconstructed image, the model was able to reconstruct the occluded images to a large extent. Training the CGAN model on Kaggle GPU took a huge amount of time even for a very small epoch, around 5 epochs I think. A better result would mean training for larger epochs.
Credit
Pix2Pix paper
Image-to-Image Translation with Conditional Adversarial Networks by Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros
Abstract
We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. Indeed, since the release of the pix2pix software associated with this paper, a large number of internet users (many of them artists) have posted their own experiments with our system, further demonstrating its wide applicability and ease of adoption without the need for parameter tweaking. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.
@misc{isola2018imagetoimage,
title={Image-to-Image Translation with Conditional Adversarial Networks},
author={Phillip Isola and Jun-Yan Zhu and Tinghui Zhou and Alexei A. Efros},
year={2018},
eprint={1611.07004},
archivePrefix={arXiv},
primaryClass={cs.CV}
}