What is StyleGAN? An overview of the key concepts of StyleGAN
StyleGAN[1] initially proposed in 2019 showed amazing performance in creating realistic images based on a style-based generator architecture by separating high-level attributes such as pose and facial expressions from stochastic variation in the images like hair and freckles. Also, it aims to spread out the latent space by introducing an intermediate latent space W which the generator infers from instead of the classic latent vector.
Key Concepts
- A learned “style” instead of a latent vector disentangles the complex image space and enables smooth transitions in the latent space.
- The styles are applied to the generator in various scales of the generation process with AdaIN or conditional batch normalization.
- Random noise is added at every scale to add stochastic variation to the model.
- Each method applied in this paper is compared and evaluated through various metrics.
StyleGAN Architecture
StyleGAN is based on the progressive GAN[2] generator and discriminator. The difference is that the initial 4x4x512 is a constant learned vector and the latent vector is generated through a style mapping and fed into the generator through an AdaIN layer. The discriminator architecture and loss function(WGAN-GP) doesn’t vary from progressive GAN.
Style Mapping and AdaIN
Conclusively, the latent z is mapped into a style mapping W through a non-linear mapping network f: z -> W. The mapping function is a fully-connected nonlinear function or an 8 layer MLP with leaky ReLU activations. The learned style mapping W is transformed linearly into multiple vectors “A” for the AdaIn layer.
AdaIN operation is defined as above and is basically a channel-wise batch normalization layer that is scaled based on a learned vector y. In this case, y is computed from the A vector for the corresponding layer. So, the style vector W can scale various features of the network at different stages of generation.
Random Noise for Stochastic Variation
A vector of random noise B is added in multiple layers of the network for the purpose of providing stochastic variation to the image. The results are illustrated above. The model can generate different hair textures and backgrounds while keeping important details in the face. This is especially useful to overcome the pressure from the discriminator to generate new images by simply utilizing the given noise and the model to focus on generating realistic images.
Style Mixing/Mixing regularization
Style mixing, like the results in the figure above, is achieved by mixing the style vectors for different scales of the image. This demonstrates the effectiveness of the W space capturing the “style” of the image. For example, coarse styles from 4x4, 8x8 affect the pose and hairstyle of the image, middle styles from 16x16, 32x32 bring smaller scale facial features, and the styles from the fine-scale alter the color scheme and details.
Using mixing regularization, a percentage of images are generated using two random latent codes instead of one during training. This is applied to guide the styles to adapt as expected and generate amazing image mixing like the figure above.
The Style Space
You might wonder why mapping the z to W before feeding it into the generator helps improve performance. This is because it disentangles the latent space and the space of the images. Because the latent and image space is continuous and the generator must provide a plausible image for every latent vector. But by a non-linear mapping from z to W, W doesn’t always have to be a fixed shape and thus the generator is less stressed from arranging the images in the latent space.
Experiments
The table above shows the effectiveness of the methods discussed in the paper. The B operation means that the images are upscaled using bilinear up/downsampling during the stage of progressive growth. The FFHQ dataset is a novel dataset proposed together with this paper which according to the authors has a larger variety than the CelebA-HQ dataset, originally built from faces of celebrities.
The table above compares ways of generating the W space. The path length is calculated by the difference of VGG embedding of images generated from the latent vectors in between two images.
“If a latent space is sufficiently disentangled, it should be possible to find direction vectors that consistently correspond to individual factors of variation.” The separability score measures how well attributes are separated throughout the latent distribution by training a linear SVM to predict the label based on latent space. This way we can measure the strength of the linear relationship between the latent space and human-understood classes.
According to the chart, the FID doesn't change dramatically by increasing the complexity of the latent encoding, but the separability seems to increase and the translation images in the latent space seem to become much smoother. This proves that the proposed StyleGAN successfully learned and separated various styles in the style space W.
References
[1] A Style-Based Generator Architecture for Generative Adversarial Networks(https://arxiv.org/abs/1812.04948)
[2] Progressive Growing of GANs for Improved Quality, Stability, and Variation(https://arxiv.org/abs/1710.10196)
We will conclude with more examples generated using this amazing method on the LSUN dataset.