StyleGAN: Explained

NVIDIA’s Style-Based Generator Architecture for GANs (Generative Adversarial Networks)

7 min readApr 27, 2022

When exploring state-of-the-art GAN architectures you would certainly come across StyleGAN. This model was introduced by NVIDIA in “A Style-Based Generator Architecture for Generative Adversarial Networks” research paper.

Bringing a novel GAN architecture and a disentangled latent space, StyleGAN opened the doors for high-level image manipulation.

In this first article, we are going to explain StyleGAN’s building blocks and discuss the key points of its success as well as its limitations.

Before digging into this architecture, we first need to understand the latent space and the reason why it represents the core of GANs.

The latent space

We can think of it as a space where each image is represented by a vector of N dimensions. The goal is to get unique information from each dimension. In this way, the latent space would be disentangled and the generator would be able to perform any wanted edits on the image.

To better understand the relation between image editing and the latent space disentanglement, imagine that you want to visualize what your cat would look like if it had ‘long hair’. So you want to change only the dimension containing hair length information. In the case of an entangled latent space, the change of this dimension might turn your cat into a fluffy dog if the animal’s type and its hair length are encoded in the same dimension.

It would still look cute but it's not what you wanted to do!

That is the problem with entanglement, changing one attribute can easily result in unwanted changes along with other attributes.

Thus, the main objective of GANs architectures is to obtain a disentangled latent space that offers the possibility for realistic image generation, semantic manipulation, local editing .. etc.

We can have a lot of fun with the latent vectors!

StyleGAN: key contribution

The key contribution of this paper is the generator’s architecture which suggests several improvements to the traditional one. As you can see in the following figure, StyleGAN’s generator is mainly composed of two networks (mapping and synthesis). It also involves a new intermediate latent space (W space) alongside an affine transform.

Figure 1: StyleGAN - generator architecture

Mapping network:

Traditionally, a vector of the Z space is fed to the generator. This simply means that the given vector has arbitrary values from the normal distribution. The authors of StyleGAN introduce another intermediate space (W space) which is the result of mapping z vectors via an 8-layers MLP (Multilayer Perceptron), and that is the Mapping Network.

But why would they add an intermediate space?

It is the better disentanglement of the W-space that makes it a key feature in this architecture. That means that the 512 dimensions of a given w vector hold each unique information about the image.

You might ask yourself how do we know if the W space presents for real less entanglement than the Z space does.

To answer this question, the authors propose two new metrics to quantify the degree of disentanglement:

Perceptual path length: to measure the degree of changes done on the image when performing interpolation. Smooth changes give better results. That is why we have to minimize this metric.
Linear separability: to calculate how well can we separate the latent space points corresponding to two image classes via a hyperplane. That is done by computing the conditional entropy to be able to tell how much information is required to accurately classify these latent points.

To know more about the mathematics under these two metrics, I invite you to read the original paper.

The authors presented the following table to show how the W-space combined with a style-based generator architecture gives the best FID (Frechet Inception Distance) score, perceptual path length, and separability. They also discuss the loss of separability combined with a better FID when a mapping network is added to a traditional generator (highlighted cells) which demonstrates the W-space’s strengths.

Affine transform — Style vectors:

It is a learned affine transform that turns w vectors into styles which will be then fed to the synthesis network. This block is referenced by “A” in the original paper. It is important to note that for each layer of the synthesis network, we inject one style vector. Interestingly, this allows cross-layer style control.

Synthesis Network:

AdaIN (Adaptive Instance Normalization): consists of a normalization followed by a modulation. It was firstly introduced in a style transfer model [2].
Noise — B: the main goal behind feeding noise to the synthesis network is the stochastic variation. For example, if you are sky-gazing and we somehow change the exact placements of the stars, you can not perceptually tell that something has changed, as long as it follows the same distribution. Thus, these little stars are stochastic to you. The same thing is applicable for different aspects of the human faces like the authors mention: hairs, freckles, skin pores.. . This randomization is reached by adding per-pixel noise after each convolution.

It is important to note that the authors reserved 2 layers for each resolution, giving 18 layers in the synthesis network (going from 4x4 to 1024x1024).

To better visualize the role of each block in this quite complex generator, the authors explain:

We can view the mapping network and affine transformations as a way to draw samples for each style from a learned distribution, and the synthesis network as a way to generate a novel image based on a collection of styles. [1]

StyleMixing — Style regularization:

StyleGAN came with an interesting regularization method called style regularization. The idea here is to take two different codes w1 and w2 and feed them to the synthesis network at different levels so that w1 will be applied from the first layer till a certain layer in the network that they call the crossover point and w2 is applied from that point till the end.

This regularization technique prevents the network from assuming that adjacent styles are correlated.[1]

Besides the impact of style regularization on the FID score, which decreases when applying it during training, it is also an interesting image manipulation method. The below figure shows the results of style mixing with different crossover points:

Figure 4: style mixing results with different crossover points applied. [1]

Here we can see the impact of the crossover point (different resolutions) on the resulting image:

Coarse resolutions [4x4–8x8]: eyes/hair/skin color are copied from A whereas the pose, hairstyle, and face shape (which the authors call high-level aspects) are copied from B.
Middle resolutions [16x16–32x32]: high-level aspects are copied from A and small aspects are copied from B.
Fine resolutions [64x64–1024x1024]: almost all the style is copied from A and only some color details are copied from B.

Truncation Trick in the W space:

Poorly represented images in the dataset are generally very hard to generate by GANs. Since the generator doesn’t see a considerable amount of these images while training, it can not properly learn how to generate them which then affects the quality of the generated images.

To encounter this problem, there is a technique called “the truncation trick” that avoids the low probability density regions to improve the quality of the generated images. But since we are ignoring a part of the distribution, we will have less style variation.

This technique is known to be a good way to improve GANs performance and it has been applied to Z-space.

StyleGAN offers the possibility to perform this trick on W-space as well. This is done by firstly computing the center of mass of W:

That gives us the average image of our dataset. Then, we have to scale the deviation of a given w from the center:

Where:

Interestingly, the truncation trick in w-space allows us to control styles. As shown in the following figure, when we tend the parameter to zero we obtain the average image. On the other hand, when comparing the results obtained with 1 and -1, we can see that they are corresponding opposites (in pose, hair, age, gender..). This highlights, again, the strengths of the W-space.

Figure 5: The effect of truncation trick as a function of style scale. [1]

Conclusion:

StyleGAN is a state-of-the-art architecture that not only resolved a lot of image generation problems caused by the entanglement of the latent space but also came with a new approach to manipulating images through style vectors. But since there is no perfect model, an important limitation of this architecture is that it tends to generate blob-like artifacts in some cases. StyleGAN2 came then to fix this problem and suggest other improvements which we will explain and discuss in the next article.

References:

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, 2018.
X. Huang and S. J. Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. CoRR, abs/1703.06868, 2017.

StyleGAN: Explained

NVIDIA’s Style-Based Generator Architecture for GANs (Generative Adversarial Networks)

Written by ArijZouaoui