Synopsis of MaskGAN: Towards Diverse and Interactive Facial Image Manipulation
MaskGAN allows the end-user to edit a segmentation mask which is used as a condition to apply face style manipulations interactively. The Generator is trained to learn the mapping from segmentation masks to target manipulated face images. However, the manipulations are learned as traversing on the mask manifold instead of the pixel space, unlike CycleGAN or Pix2Pix GAN for example. This detour results in more diverse results. The objective function is modified to adapt to dual-editing consistency. MaskGAN is based on two components: Dense Mapping Network (DMN) and Editing Behavior Simulated Training (EBST). The authors have also contributed CelebAMask-HQ that consists of over 30,000 high-resolution face images and annotated masks dataset. In the following discussion, I shall break down the system into small components explaining the why and how of each one then I will stitch them together to construct the MaskGAN.
Mask Variational AutoEncoder
Variational AutoEncoder is the ancestor of Generative Adversarial networks with a rigorous mathematical foundation. The idea is to learn a latent space vector that consists of the mean and the variance of the latent space distribution. This results in a gaussian smooth and continuous latent space. And this can be utilized to obtain a smooth transition in the manifold between the source mask and the target mask. By learning the latent space of the masks, then it is possible to apply a smooth style transfer transition using the Adaptive Instance Normalization (AdaIN) equation. Having obtained a latent representation of the source and the target masks, then the decoder can be used to reconstruct new source and target masks. These are then used to generate two new face images of both source and target using a Pix2PixHD GAN as the backbone of the Dense Mapping Network (DMN).
Spatial Feature Transform
SFT-GAN introduced the notion of spatial-aware style information. The idea to fuse together two feature maps of both an input face image and the corresponding mask. The result is a pixel-wise and feature-wise transformation matrix that can be further utilized for style transfer. This is achieved using a Spatial Feature Transform (SFT) Layer. This functions by learning transformation parameters given the feature map of the mask. Then, applies the element-wise product on the feature map of the face image utilizing the transformation parameters obtained.
Dense Mapping Network
The Dense Mapping Network is a main component in MaskGAN. Given a segmentation mask, the network generates a face image. The backbone of the network is a Pix2PixHD Generator. Both the face image and the mask are encoded using a feature map extractor. Then, the feature maps are used as inputs to the SFT Layer. The spatial-aware style information is then used to generate style parameters using the Adaptive Instance Normalization (AdaIN) equation. AdaIN is applied to the affine parameters output by the SFT Layer and the Residual Blocks of the Generator. The Generator utilizes these spatial-aware style information to transition styles such as hairstyle, skin style, beards, eyebrows .. etc to the corresponding position in the target face image.
Editing Behaviour Simulated Training
The first stage trains the DMN. The second stage utilizes the DMN as a component. The second stage involves Editing Behavior Simulated Training (EBST). EBST is a serial collection of modules that differ from the DMN in that it is designed to process two paths of mask-face image pairs. Hence, here is actually where the style transfer occurs. First, Mask VAE is applied to the input mask and results in two masks; one for the source and the other for the target. During inference, the end-user can modify the target mask interactively to control the transfer process. The masks are then fed to DMN to reconstruct two face images. There are two DMNs for this purpose working in parallel. Then, a blending network is used to merge them together. The merging is not a simple concatenation, but it is learned by the Alpha Blender Network. The result is the manipulated face image.