Anime Image Generation by Style GAN

6 min readNov 9, 2022

What is GAN? 😝 😝

The generative adversarial networks (GAN) first appeared in the paper “Generative adversarial networks". Goodfellow et al. (2014) introduce a brand new structure, the “Adversarial Net”.

Adversarial Net is combined with two components — the generator which is used to generate images, and the discriminator which is used to discriminate whether it is true or not.

Interaction between the generator and the discriminator — source

z: represent random noise (in this case there are 100 random numbers), z is also called a ‘latent vector’.
G(z): represent the output 64 * 64 pixel image with the shape of (64, 64, 3)

The target of GAN:

Generator: tries to fool the discriminator by generating real-looking images

Discriminator: tries to distinguish between real and fake images

The generative model can be thought of as analogous to a team of counterfeiters, trying to produce fake currency and use it without detection, while the discriminative model is analogous to the police, trying to detect the counterfeit currency.
Competition in this game drives both teams to improve their methods until the counterfeits are indistinguishable from the genuine articles
(Goodfellow et al., 2014).

The training process:

The generator generates images (fake images) from random noise
Input real or fake images to the discriminator, and then the discriminator will output a scaler ( value between 0 ~ 1, fake:0, real:1) which represents the probability of whether it is a real image
Using the output of the discriminator (loss) to improve the generator and discriminator
keep looping the previous process until the image generated from the generator could get output value 1 from the discriminator

If the generator could get output value 1 from the discriminator, this means that the discriminator believes the fake image from the generator is real, which represent the discriminator could not distinguish between fake and real images anymore. 😢

Finally, we can use the generator to generate images similar to the distribution of the data. 😄

For detailed information, here is the link to the original paper

Problem with the classic GAN structure

With the remarkable performance of the “Adversarial Net”, a lot of tasks in Computer Vision have now applied GAN as the basic structure such as the generation of High-Quality images and the reconstruction of missing parts of an image. However, there still exist some problems with the classic GAN structure.

Here shows the main issues of classic GAN:

The generator operates as a black box: like most DL models, the explainability of the model is still a big challenge.
It is hard to control the style of the image: Since the input of the generator is a bunch of random numbers, for instance, a vector with 100 components. We could expect that by controlling these 100 numbers, we can control the style of the final image. But, the fact is, it doesn’t work! 😢

Style GAN

The Style Generative Adversarial Network, or StyleGAN for short, is an extension of the GAN architecture first proposed by Karras et al., (2018). Which introduces two new structure Mapping network and the Synthesis network.

👉 👉 The structure of the discriminator and the training process is the same as the classic GAN

The main modification in style gan

Mapping network: instead of putting the latent vector z directly into the network, we use multiple fully connected (FC) layers to map it to the space which fits the network more.
The synthesis network: after we got the vector w (A) from the previous Mapping network, we concat it to every layer (AdaIN) while adding some noise (B) into it.

Detail of style gan

Mapping network

To solve the problem of less control over the style of the classic gan, we have to first talk about the concept of entanglement.

explanation of mapping net of style GAN — explanation of mapping net (Karras et al., 2018)

The above was the explanation from the original paper, let's try to explain it in normal English.

In the case that we want to control two factors (eyes and face size), we assume that we only need to modify a single element of the latent vector to control both eye and face size at the same time. (entanglement)

By controlling (increasing or decreasing) the value of the element, we could get four kinds of different combinations [small eyes, small face], [small eyes, big face], [big eyes, small face] and [big eyes, big face]. However, the combination [small eyes, big face] and [big eyes, small face] are rare or do not exist in real-world situations. As a result, the generator will try to generate images that are unrealistic.

With the mapping net of the latent vector (z), we can now expect the FC layer could transfer the original vector z into w which has disentangled spaces (simply means each element of vector w only control a single style)

Synthesis network

Synthesis network is another important change in style gan. Unlike the classic gan using latent vectors z as the sources to generate images, style gan uses the z to change the style of the image.

Classic GAN:

The image is generated from the latent vector z. By upsampling the vectors z, we could get the image with a shape (64, 64, 3). In this case, z could be seen as the source of the generator.

Style GAN:

Different to the classic gan, the latent vector w of the Style GAN is applied to all the layers. This could be seen as vector w is only used for the style transfer in each layer. Besides, the starting point of the network will be a constant value. (the starting point of the classic gan is latent vectors z)

Using StyleGAN to Generate an Anime picture

Ok, finally! 💛 As most of the structures in style gan are the same as the classic GAN, here I will simply implement the key block of the generator in styleGAN.

For the full version code, please refer to my GitHub: click on me

This is the structure of the Synthesis block, x is the mainstream of the network (representing the image), and w is the output of the latent mapping. This cell is actually doing a linear transformation to the original image x.

This is the implementation of Mapping and StyleBlock. The mapping block is simply combined with several Dense layers (fully connected net), in my code, there are 5 dense layers.

This is the structure of the generator. For the input of the Mapping layer, we will resize it into shape (batch_size, n_style_block, latent_dim) so the output (w) will also have the same shape. Then we will slice w along the second dimension ( w[:, i] ) and add it to the style block to control the style of each resolution.

number of style blocks: 5
latent dim: 128

After the last Conv2D layer map the filter size to 3, we could get our result, a 64 * 64 pixel image.

Result

I tried to modify the input of the last style block and found that this block controlled the hair colour. The columns in the image show the result of different inputs to the last style block.