Face Swapping Using Face and Hair Representation

NeuroHive
4 min readApr 22, 2018

A human face is an important symbol to recognize individuals from ancient time to the present. Human faces are used to identify the identities of peoples. People using social media share their photos which usually have human faces. Face recognition has been in demand, and a large number of studies has been done in this area, e.g., face recognition, face manipulation, etc. Face swapping is one of the techniques that have a lot of applications such as photomontage, virtual hairstyle fitting, privacy protection, and data augmentation for machine learning. Before neural networks coming to the limelight, people using the handcrafts methods to detect faces or swap faces in images.

One of the well-known technique for face swapping is 3DMM (3D morphable model). In face swapping the face geometric is obtained with respective texture maps by fitting 3D Morphable Model(3DMM). The source and target images texture maps are swapped with the estimated UV coordinates. After that, the replaced texture map is re-rendered using the estimated lighting condition. This technique can swap the face with the different orientation or in different lighting condition. The problem with this method is that it fails when the lighting conditions are unrealistic and illumination corrected texture is difficult.

Researchers use deep learning for face swapping with large-scale image dataset for training. FakeApp, a desktop application which uses deep learning for face swapping, uses hundreds of images for training each person. Collecting pictures of celebrities are pretty easy, but it is very hard for an individual to correct hundred of images with own face. So it’s impractical to collect many photos and fine-tune the network to generate a single face swapped image.

How It Works

This problem has been addressed by using General Adversarial Network and two variational auto-encoders. The network is designed to learn from different latent spaces (you observe some data in space and you mapped that data to latent space where data points are closer.) for each face and hair regions. Face and hair regions are handled separately in the image space. This network is called as Region-Separative GAN (RSGAN). The two auto-encoders which were named as separator network and one GAN, which referred as composer network in the paper. The architecture is shown in fig 1.

Figure 1: The RSGAN architecture with one Composer network and 2 separator network

The face and hair regions are first encoded into different latent-space representations with the separator network. Then, the composer network generates a face image with the obtained latent space representations so that the original appearances in the input image is reconstructed. However, training with only latent-space representations from real image samples incurs over-fitting.

Let x be a training image, and c be its corresponding visual attribute vector. Latent-space representations zxf and zxh of face and hair appearances of x are obtained by a face encoder FE-xf and a hair encoder FE-xh. Similarly, the visual attribute c is embedded into latent spaces of the attributes. Latent-space representations zcf and zch of the face and hair attribute vectors are obtained by encoders FE-cf and FE-ch. The composer network G generates the reconstructed appearance x’ with the latent-space representations from the encoders.

These reconstruction processes are formulated as

Three loss functions are required for face(x’f), hair(x’h) and x’ for an input image. Following are the loss functions for auto encoders:

where MBG is a background mask which take 0 for foreground pixels and 1 for background pixels, and an operator ⨀ denotes per-pixel multiplication. The background mask MBG is used to train the network to synthesize more detailed appearances in foreground regions.

The set of separator and composer networks, and the two discriminator networks are trained adversarially as standard GANs. Adversarial losses are defined as:

Result

The proposed system achieves high-quality face swapping, which is the main scope of this study, even for faces with different orientations and in different lighting conditions.

Figure 2: In the top group, random hair appearances are combined with the face region of an input image in the left. In the bottom group, random face appearances are combined with the hair region of an input image.
Figure 3: Face swapping results for different face and hair appearances. Two top rows in this figure represent the original inputs and their reconstructed appearances obtained by RSGAN. The last three columns have hair sources, which is used for each person.
Comparisons to the state-of-the-art face swapping methods

The RSGAN achieves high-quality face swapping, which is the main scope of this study, even for faces with different orientations and in different lighting conditions. Since RSGAN can encode the appearances of faces and hairs into underlying latent-space representations, the image appearances can be modified by manipulating the representations in the latent spaces. As a deep learning technique, the success of the RSGAN architecture and training method implies that deep generative models can obtain even a class of images that are not prepared in a training dataset.

Muneeb ul Hassan

--

--

NeuroHive

Data science state-of-the-art: neural networks, machine learning, computer vision. https://medium.com/neurohive-computer-vision