An overview into InterFaceGAN: Edit facial attributes of people using GANs

Published in

CodeX

8 min readAug 24, 2021

How can we edit the semantical attributes of images using GANs? For example, change the age or gender of a person while preserving the general face shape and other attributes? A typical warping between the latent vector of two images demonstrates a smooth transition in facial features, but multiple features are entangled together, and precise control over a specific attribute is almost impossible.

Despite the amazing quality of images GANs can generate, not much has been done on understanding and manipulating the latent space. Previous work on semantic image editing using GANs involves retraining using carefully designed loss functions, additional attribute labels, or special architectures. Can’t we use existing high-quality image generators to edit given images? As the paper suggests, we must understand how individual facial features are encoded in the latent space both theoretically and empirically.

Previously, we reviewed Image2StyleGAN where we looked at the method to map given images into the StyleGAN latent space, which is often referred to as GAN Inversion. We will review a method that disentangles and isolates the change in multiple semantics. The paper suggests a pipeline made up of a set of techniques to disentangle semantic-level face attributes in the latent space and enable precise control of the attribute.

This paper…

Analyzes and measures how different semantic attributes are encoded in the latent space.
Disentangles these semantic attributes using subspace projection.
Proposes a pipeline for face editing, able to alter one attribute without affecting others.

Official paper: InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs

Properties

The paper considers the following properties in the thesis 😨. Don’t be afraid though because the intuitions are straightforward 😆.

A hyperplane of an n-dimensional space is an (n-1)-dimensional subspace that can separate the original space. e.g. a 2D plane can separate a 3D space, a 1D line can separate a 2D plane.

The first property suggests that when we define the hyperplane with normal vector n^T z=0, vectors points with n^T z>0 are on a specific side of the hyperplane. Imagine a 2D plane divided by a linear equation.

Understanding the GAN Latent space

The generator in GAN can be viewed as a function g: Z → X where Z is typically a Gaussian distribution and X is the image space. Consider a semantic space S ⊆ R^m with m semantics and a semantic scoring function f_S: X → S. Intuitively, the semantic score of a latent is measured as f_S(g(z)).

The paper suggests that we observe linear changes in the semantics contained in images when we linearly interpolate between two latent codes. Suppose there is only one semantic(m=1). Consider a hyperplane with a normal vector n that separates the semantic. We define the “distance” to a sample z as d(n, z) = n^T z. We expect the distance to be proportional to the semantic score, f(g(z)) = λ d(n, z).

According to property 2, any latent z ~ N (0, Id) is likely to be close to a given hyperplane. Therefore, we can model a semantic into the linear subspace n.

Let's consider the general case with multiple m>1 semantics. Consider s = [s_1, … s_m] as the true semantic score of a generated image, s ≈ f_S(g(z)) = ΛN^T z where Λ is a constant vector and N is a matrix containing m separation boundaries. Using basic statistical rules, we can compute the mean and covariance statistics as

We can then conclude that s is actually sampled from a normal distribution s ~ N (0, Σs). Intuitively for each vector in s to be completely disentangled, Σs must be a diagonal matrix. (n_i)^T (n_j ) can also be used to measure the entanglement between the i-th and j-th semantic.

*n^T denotes the transpose of the vector n.

Conditional manipulation

Suppose we found the decision boundary n of a certain semantic. We edit the original latent code z with z_edit = z + αn. When multiple semantics are entangled, editing one semantic can affect other attributes. For example, moving a point in the n1 direction will not only affect attribute 1 but will also change the distance of attribute 2. To counteract this, the paper applies projection to make N^T N a diagonal matrix where semantics are independent of each other.

Consider two hyperplanes with normal vectors n1 and n2, the projected direction n1 − (n1^T n2) n2 (black vector) changes attribute 1 without affecting attribute 2. With more than two attributes, we subtract the projection from the primal direction(n1) onto the plane constructed by all conditioned directions.

Finding semantic boundaries

How can we define the semantic boundaries of facial attributes in the latent space? We train a linear SVM to predict a certain binary semantic(e.g. images of man vs woman) given its latent codes. The linear model itself defines a hyperplane on the latent space and the normal vector n can be derived from the model.

Labels of factual attributes on latent codes are assigned using an auxiliary classifier trained on CelebA attributes. Among 500K synthetic images, 10K images with the highest confidence levels for each label (e.g. 10K men and 10K woman images) are sampled as the training and validation set. This process is elaborated in Section 3.3. “Implementation Details” of the paper.

Manipulating real images

How do we apply the learned semantics to a given image for editing applications? This is implemented via 2 approaches: GAN inversion and further training.

GAN inversion maps the target face back to a latent code. This can be challenging because GANs don’t capture the complete image distribution and there is often a loss of information. We discussed GAN inversion in general, and a powerful method for GAN inversion in a previous post.

Among the two approaches to GAN inversion, the paper uses LIA as a baseline for encoder-based inversion and searches the W+ space for optimization-based inversion, suggested by Image2StyleGAN and other papers. Optimization-based methods perform better but are much slower compared to encoder-based approaches.

Another approach is to train additional models on a synthetically generated paired dataset using a learned InterFaceGAN. The InterFaceGAN model can generate unlimited high-quality paired data. The idea is to train an image-to-image translation model such as pix2pixHD on the generated data. To implement the continuous manipulation, the translation model first learns an identical mapping network and a network fine-tuned to attribute translation. At inference, we interpolate between the model weights from the identical model to the fine-tuned model.

This approach has a much faster inference speed, and the capability to fully preserve additional information since it removes the need for reconstruction. However, due to the inherent limitations of pix2pixHD, the model can’t learn large movements such as pose and smile. Thus, the applications are limited.

Results & Experiments

There are a lot of assumptions made in the formulation of the method described above. Are the semantics really represented linearly in the latent space? Can a linear model properly learn the semantic boundary? Is each semantic subspace really independent of others? Can subspace projection practically disentangle the complicated semantics? Can the learned semantic boundaries generalize to real-world images? These questions about assumptions are assessed in this section.

First, are the latent space separable by a linear boundary, or a hyperplane? The figure below shows the classification performance of the linear SVM. The linear boundaries achieve ~95% accuracy for both PGGAN(Progressive GAN) and SyleGAN W space on the validation set. This suggests that for binary semantics, an approximately linear hyperplane does exist.

In the series of experiments, the paper observes that interpolation in the W space is better compared to the Z space. The results on the entire set are low because they include less semantically significant images.

The figure below interpolates a latent code directly based on the normal vector without disentanglement. We can clearly see that each attribute is correctly applied and removed. This further demonstrates that the latent space is linearly separable and InterFaceGAN can find the separation hyperplane successfully.

We then answer the question about whether these semantic boundaries generalize to editing real images. If we observe the figure below, the results are mind-blowing. I was especially surprised by how the model learned to draw “strong” eyeglasses and “weak” eyeglasses. The GAN seems to learn some interpretable semantics in the latent space.

They observe that when the latent is sampled too far from the boundary and the natural latent distribution, changes in other attributes are introduced. Such as in an example in the figure below.

Are the semantics really disentangled using the method proposed? To assess the correlation between semantics, the paper proposes several metrics for disentanglement measurement.

A predictor trained to predict attributes is used to measure the correlation coefficient between two attributes in real data. This can be used to measure the entanglement between two attributes. The correlation coefficient for synthesized data is also computed and compared with the correlations of real data. Finally, the cosine similarity between the normal vectors of latent boundaries can be used. Intuitively, these methods should be able to measure disentanglement.

These methods all show similar results practically. In the tables below, we observe high entanglements between certain attributes, such as age, gender, and glasses. We also observe differences in entanglements between PGGAN trained on CelebA-HQ and StyleGAN trained on FF-HQ. We can observe some bias in the data specific to FF-HQ such as the entanglement between smile and gender. Another interesting observation is that the W space is significantly disentangled compared to the z space. However, this entanglement in the z space can be mitigated through the proposed conditional manipulation as illustrated in the last row in the figure below.

Disentanglement analysis on PGGAN, CelebA-HQ

Disentanglement analysis on StyleGAN, FF-HQ

Conclusion

The paper understands the GAN latent space by assuming linear changes in the semantics. The observations conclude that each semantic is represented as a normal distribution based on the normal vector of the hyperplane. By considering a hyperplane that can linearly separate the latent space according to semantic attributes, we can model this hyperplane using a linear classifier. A linear SVM from latent to semantic labels can define the “direction” of semantic attributes in the latent space. This is disentangled using subspace projection.

The insights proposed in this paper were very interesting in terms of understanding how neural networks interpret human faces. I was especially stunned about how projection was able to disentangle various semantics.