Hairstyle Transfer — Semantic Editing GAN Latent Code

Published in

The Startup

9 min readMay 24, 2020

Introduction

Recent advances in Generative Adversarial Network (GAN) have shown impressive results in the quality and resolution of the images synthesized, especially in the field of style transfer. Motivated by the recent success of StyleGAN [1], where stochastic variation is incorporated in the realistic-looking synthesized images, we propose to focus on one of the most practical variations — hairstyle.

A hairstyle, as an important part of personal appearance, can be expressive of one’s personality and overall style. More often than not, the right hairstyle can only be discovered through trial and error. Thus, being able to virtually “try on” a novel hairstyle through a computer vision system seems to hold practical value in reality.

Problem Statement

Input: A human face image

Output: Images with the same face but different hairstyles

Principal Method: Exploring GAN latent space

Exploring GAN Latent Space

The rationale of GAN is to learn a non-linear mapping from a latent distribution to real data through adversarial training.

Usually, the relation between latent space and semantic attributes is unknown. For example, how does the latent code determine the hairstyle generated, such as bangs, colors, and etc? Also, it is hard to judge if these attributes are entangled with each other.

Our approach in this project is to explore how a single or multiple hairstyle semantics are encoded in the latent space of trained GAN models, such as PG GAN [1] and StyleGAN [2].

We take advantage of the notion brought up by the InterFaceGAN [3] paper: for any binary semantic, there exists a hyperplane in the latent space serving as the separation boundary. Based on this idea, we managed to disentangle attribute representations by linear transformations.

Background

PG GAN

The first model this work is based on is the Progressive Growing (PG) GAN [1], which comprises a hierarchy of generators and discriminators with increasing resolutions. They capture large-scale structures on coarse levels and details on fine levels.

All generators and discriminators are trainable during the training phase and we usually add layers incrementally as the training advances. Skip connections between a layer’s input and its output are utilized to retain the results of the previous layer.

PG GAN manages to solve some traditional GAN problems.

It is hard for traditional GANs to generate high-resolution images because it is easier to tell the generated images apart from the real images at a higher resolution. This may lead to big gradients that force the network to converge slower. Besides, larger images require smaller batch sizes that lead to unstable training.

The hierarchical structure solves some traditional GAN problems by training from coarse to fine:

Coarser layers are stabler to train since there are less class information and fewer modes.
To improve the training speed, most iterations are done at lower resolutions. This produces comparable results in a much shorter time.

StyleGAN

Another underlying model that we use is StyleGAN [2].

First of all, StyleGAN converts the input latent code z into an intermediate latent code w in a non-linear manner.

Z refers to the original latent code space and reflects the probability density of the training data, which often leads to unavoidable entanglements. W here is the intermediate latent space and is induced by a learned piecewise continuous mapping from Z.

In the synthesis network:

A is learned affine transformations that transform intermediate latent code w to (spatially invariant) styles y.
B is learned per-channel scaling factors to the noise input.
AdaIN refers to "Adaptive instance normalization", which first normalizes features x before scaling and biasing them by styles y.

The synthesis network also follows a hierarchical structure where each convolution layer adjusts the style of the image at a different resolution. This controls the strength of image features at different scales.

Now, let’s dive into how you could semantically edit your favourite image.

Workflow

The following workflow allows you to take a human image, generate its latent code estimation, and semantically edit it with the hair attributes that you care about. Let the fun begin!

Step 1 — Latent Code Estimation

To do semantic editing, we first need to find the query image inside the StyleGAN latent space. Now the question is, with a given input image, how to find a latent vector z such that if we send z through the generator, we can still get the same input image?

One way to do this is to optimize the feature vector that has a high-level semantic meaning of what’s in the image.

First, we send the input image into a pre-trained Residual Network for an initial latent code estimation in StyleGAN. We then take this estimation and send it to the generator. This will give us an initial guess of the original input image. To this image, we can apply a pre-trained image classifier for feature extraction purposes. Meanwhile, we will do the same feature extraction for the input image.

In the feature space, we then perform gradient descent — minimizing the L2 loss of the features vectors and updating the latent code estimation (red arrow). This method of doing gradient descent on semantic feature vectors has an edge over gradient descent on pixel loss because using L2 optimization directly in the pixel space can lead us to get stuck easily in bad local optima.

We can now use this approach to find ANY image inside the StyleGAN latent space. Below are some examples of input images and their latent code representations. Pretty close, right?

Left: Input image; Right: Latent code representation

Step 2— Semantic Editing with Boundary

Same as Interface GAN [3], we will define “semantic editing” as editing an image with a target attribute only while maintaining all the other information as well as possible.

Before we dive into editing, we need to look for specific boundaries that can separate binary attributes in the latent space. Each of the boundaries will be corresponding to one hair attribute in particular.

With respect to this project, here are the hair attributes we are interested in studying:

Style wavy/straight, bangs
Color black/blond/brown/gray
Hairline receding hairline
Facial Hair mustache, sideburns

So how to find the boundaries? We first need to do latent space separation. This work [3] has introduced a robust approach for such a purpose.

With the assumption that for any binary attribute, there exists a hyperplane in latent space such that all samples from the same side are with the same attribute, we can train independent linear SVM responsible for each attribute. Our job is to find such a hyperplane from the 512-dimensional latent space from StyleGAN.

512-dimensional latent space from StyleGAN

To look for the hyperplane, we need paired data of latent code and score for this attribute. One obvious solution is to find face images where such attribute is potent and manually label them with 0/1 scores. We toyed with this idea and manually labelled 50 images to confirm the feasibility of finding the boundary. Later we decided to use pre-trained classifiers trained on a large dataset (CelebA) for the hair attributes, which are provided with StyleGAN.

Hyperplane for “bangs”, separating faces with bangs from the ones without

We used 10 classifiers that match the 10 attributes to generate ~20k latent code and score pairs. With the paired latent code/scores, we trained independent linear SVMs on the hair attributes mentioned earlier and then evaluated them on the validation set, reaching accuracies ~80%.

Pipeline to Generate Boundary for Semantic Editing a Specific Attribute

Putting things together, for each input image, we will first find its specific location in the StyleGAN latent space, and then move it along a specific direction for semantic editing.

With the linear hyperplane for each attribute, we will take its normal vector as the direction along which the output faces will have continuous changes with regard to the target attribute. For example, in the figure above, we found the latent code of the image of young Leonardo DiCaprio inside StyleGAN space, drew a direction orthogonal to the bangs hyperplane, moved the location of the latent code alongside the direction. This would create a morphing sequence of DiCaprio having no bangs, fewer bangs, bangs, and more bangs!

Morphing Sequences of Continous Changes from Fewer to More Bangs

Lastly, we want to talk about Conditional Boundary, which is also introduced in InterfaceGAN [3]. More often than not, many attributes can be coupled with each other. For example, receding hairline is associated with age, long wavy hair is more frequent in female faces, and facial hair such as mustache and sideburns are often spotted in male faces only. Thus, it’s crucial to disentangle the target attribute from another attribute that correlates with it.

As shown in the figure above, given two hyperplanes with normal vectors n1 and n2, moving along the projected direction n1 − (transpose(n1) *n2) * n2 can change attribute 1 without affecting attribute 2. This operation is called conditional manipulation as pointed out in InterfaceGAN [3].

From our experiments, we discovered that receding hairline is an attribute that’s correlated with the attribute smile: the output of receding hairline tends to slightly open their mouth and smile. We imagine this is because, in the dataset, people with receding hairline seem to be more friendly and smiley. Thus, to produce an output with receding hairline without them smiling, we can subtract the projection from the primal direction (receding hairline) onto the plane that is constructed by the conditioned direction (smile). Below is a result of the face of George Clooney with and without a conditional boundary.

Receding Hairline with/without Conditional Boundary

Findings

Based on our experiment, we realize that many attributes are correlated in the latent space. For example, when editing the volume of hairs, the person will become older or younger based on how much hairs are on the person’s head.

Not every result we get is perfect. When the workflow fails to generate reasonable output in our experiments, we can often optimize our boundary with conditions accordingly. If an input image has a large distortion, we can find which attribute resembles more to the distortion then apply this attribute as a conditional attribute to our primary attribute. For example, when editing facial hairs, we can conditionally bound it with the smile attribute to the person so out the output face won’t have their mouth open.

However, there are also some failure cases that are beyond saving, such as heavy distortion, vampirizing the face, or no results at all.

In addition, we found the generative model is possibly biased. Since the dataset used to the training generator is based on real humans, gender-specific attributes seem to only appear with a specific gender. For example, adding a mustache to female faces will make the person look more masculine yet it produces little to no mustache on the face.

Conclusion

In conclusion, we can edit one attribute of the human face by finding the hyperplane boundary in the latent space, which can generate amazing yet not perfect results. So far, we can set one attribute as a conditional attribute along with the primary attribute, as discussed in InterfaceGAN [3]. Additionally, when using one attribute to edit our face, some other attributes may also be changed because of their correlations. We believe using a better classifier can control more than two conditions at the same time and make the boundary more explicit. Last but not the least, this model can’t generate a female face with male attributes, and vice versa, we think this can be solved by using a special dataset as the training data for the generator.