How to edit images with GANs? Controlling the Latent Space of GANs

Published in

CodeX

7 min readAug 17, 2021

Recent works on image generation such as StyleGAN and BigGAN were able to generate images of truly magnificent quality. While these works on GANs are very exciting, we must understand more about the latent space to develop interesting applications. We want to be able to control what types of images we want to generate.

In this post, we will review a paper on how to embed a given image in the GAN latent space. To understand these papers, you will first need to understand GANs and StyleGAN which we will not explain in this post.

This paper …

Studies the GAN latent space and provides insights about it.
Proposes to embed into a ‘W+’ extended latent space for StyleGAN.
Suggests an embedding algorithm that maps a given image into a latent.

Original paper: Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?

Embedding new images into the Latent space

The first step for image manipulation in GANs is to be able to map a given image into the latent space. A popular approach to achieve this is to train an encoder to map the image into the latent space. After training the generator using GAN loss, we freeze the generator(decoder) weights and train an encoder that maps images to the latent space. The generator then generates a synthetic image based on the predicted latent. The encoder network is trained like an AE by comparing the original image with the generator’s output.

The pipeline of Zi2Zi Chinese character generation, example of the process described above.

Another approach is to start from a random point in the latent space and learn the input(latent) through gradient descent so that the output of the generator matches the given image. The first approach is typically used because it is faster at test time.

The GAN latent space

This paper proposes a more general and stable extension of the second approach of “gradient learning the input”.

Top row: input images. Bottom row: results of embedding the images into the StyleGAN latent space.

Is it even possible to embed arbitrary images in the GAN latent space? The paper experiments to embed 25 images of 5 categories faces, cats, dogs, cars, and paintings in the GAN space. While classes such as cars intuitively have no structural similarity with human faces, the embedding algorithm and generator were capable of going beyond human faces. Though there are more defects on other classes compared to Obama's face, this demonstrates the surprising embedding capabilities of the generator. Examples are shown in the figure above.

How Robust is the Embedding of Face Images? Another experiment tested the robustness of embedding performance by attempting to embed an image with affine transformations as illustrated in the figure above. Typically, there are blurry images and more lost details compared to normal embedding. However, the model was surprisingly robust to defects like the figure below in the image. These results imply that the generator is dependent on spatial transformations but embeddings of facial features are independent of each other.

W+ Latent Space

StyleGAN has multiple latent spaces that could be embedded. The authors test various latent spaces and compare results in the figure above. The Z space is the initial latent space, the W space is the intermediate latent space after the fully connected mapping. Embedding images into these trivial spaces gives poor results as in Figure (c)(d). Embedding in the early stages doesn’t seem to convey sufficient information to the generator about the given image. To solve this, the authors propose to exploit the StyleGAN architecture and embed images on extended latent spaces.

Left: original StyleGAN, Right: Embedding to W+ applied to StyleGAN

The W+ latent space is a concatenation of 18 different 512-dimensional w vectors for each style of the AdaIn layer in the StyleGAN. The paper proposes to manipulate the w inputs of each AdaIn layer in StyleGAN. Won’t this interfere with the network too much? With this level of interference, do the network weights even matter? Figure (b)(e) shows the embedded image on random initialized models. It seems that W+ latent space has control over the output to generate a given image even on random models but needs the trained convolutions for quality reconstruction, as illustrated in Figure (e).

How Meaningful is the Embedding?

This all stems down to the conversation on whether the embedding in the W+ space is meaningful. The primary objective of these works is applications to image editing. The paper performs manipulations on the latent code that correspond to some image editing applications to test the effectiveness of the latent.

First, the authors perform morphing between two images on the embedded latent. This is implemented by generating on a latent calculated as a linear interpolation between the latent of two images. In the figure above, the algorithm was able to generate high-quality morphing between face images (in-domain) but didn’t work so for non-face images. We can find face-like artifacts in the interpolation process. This suggests that the inner latent space is dedicated to human faces.

This enables morphing between two given faces! The work is truly amazing. However, it seems that the generation of out-of-domain data is a stretch that stems from the unnatural effects that the embeddings in the W+ space can pose to the final output.

The fundamental design of StyleGAN enables its applications to style transfer. Te latent codes for early layers affect spatial and high-level features such as face shape and pose while the deeper layers pose changes to texture and color scheme. So for example, if we use the latent code of Obama for the early layers and the latent code of a cartoon for deeper layers, we get the style transfer effect illustrated described above.

The embeddings were able to transform low-level features into faces(figure above) but fail to preserve high-level contents of non-face images(below). This further backs the claim that the inner latent space is dedicated to human faces.

The paper also provides experiments on expression transfer which also showed to be successful. Overall experiments on human faces proved to be very successful, showing that embedding on the W+ latent space is a reasonable decision. The generated embeddings were semantically meaningful for playing around on many image editing applications.

Although the generator can sufficiently generate out-of-domain images from their embeddings on the W+ space, experiments suggest that embeddings of non-face images have less semantic meaning. Fragments of human faces that appear during manipulations suggest that the inner latent space is dedicated to human faces.

Embedding Algorithm

Finally, we will discuss the optimization algorithm that embeds a given image onto the manifold of the pre-trained generator. The method optimizes the initial embeddings w on a dedicated loss using gradient descent.

One way to the initial embedding is to simply sample from a uniform distribution. However, in the StyleGAN setting where the inputs of AdaIN layers are a complex distribution, we can expect the final vector to converge to a vector w* closer to the mean latent vector. We can initialize the embedding as the mean latent vector. This is effective for embedding face images, but not for non-face images.

A perceptual loss aims to measure high-level similarity between images. Intermediate features of the VGG image classification model are commonly used for calculating perceptual loss. Specifically, we compute the difference between the features(output of Conv layers) of images. The loss function is defined as a combination of perceptual loss and pixel-wise MSE loss between the given image and the output of the generator.

The paper performs ablation studies on various losses, training configurations, and hyperparameters which are provided in the original paper.

Summary & Opinions

We can embed any given image into the StyleGAN latent space. Even images out of the training domain.
Face embeddings achieve good semantic meaningfulness when experimented on various manipulations.
Non-face embeddings fail critically on such experiments and suggest that while StyleGAN can generate such images, the latent space is mostly human faces.
Embedding to W and Z space doesn’t provide sufficient information to the generator. W+ space works fine.
We can initialize the latent vector as the mean latent vector.

This paper provides interesting insights on the GAN latent space(StyleGAN in particular) and suggests that by selecting sufficiently complex latent spaces, we can map a given image into the StyleGAN latent space with minor information loss. The insights and improvements made on this paper were amazing.

Many experiments in the paper involved non-face images. The results and hyperparameter suggestions for non-face images contradicted the results on face images most of the part. The authors also noted that the latent space is dedicated to human faces. Intuitively, the generator trained on faces isn’t supposed to be able to generate images of cars. I am questionable on whether these experiments have a meaning.