These football clubs don’t exist — Sharing my experience with StyleGAN

Training StyleGAN with ~1K images of English football club logos

Jae Won Choi, MD
Analytics Vidhya
8 min readJun 14, 2021

--

Image by the author. Football club logos created by StyleGAN.

This post covers

  • Preparing a custom dataset through web scraping
  • A simple but practical how-to of training StyleGAN with the custom dataset
  • How to get latent vectors and manipulate them in StyleGAN

Introduction

Generative adversarial network (GAN), a type of generative model for unsupervised learning, has progressed rapidly since its first development in 2014 by Ian Goodfellow. The main goal of a GAN is to learn the distribution of real-world data and synthesize realistic “fake” data that are indistinguishable from the original real-world data.

A Style-Based Generator Architecture for GANs (StyleGAN), developed by NVIDIA in 2018, is now one of the most well-known GANs. StyleGAN is not only able to produce realistic high-resolution images but also strong in terms of disentanglement and thus enables fine style control. For a detailed explanation of the original paper, please refer to this post:

Since StyleGAN is originally trained using human-face data, fine-tuning StyleGAN on other non-facial images was difficult because it is difficult to build a large custom dataset. There have been some updates to StyleGAN and the most recent version at the time of writing is StyleGAN2 with adaptive discriminator augmentation (ADA). The significance of the latest version is that you can train the network without overfitting with relatively limited data.

Luckily, NVIDIA research provides Github repositories of all versions of StyleGAN including Tensorflow implementation of StyleGAN2-ADA (the most recent version is Pytorch implementation of StyleGAN2-ADA):

Therefore, I wanted to try out StyleGAN2-ADA on football club logo images and share my experience with those who want to train StyleGAN with a relatively small custom dataset.

Let’s start by cloning the StyleGAN2-ADA repository:

git clone https://github.com/NVlabs/stylegan2-ada.git

These are the packages that I used:

Web scraping from Wikipedia

Since I could not find any public dataset, I decided to gather images from the internet. Although there are football clubs all around the world, I chose to use those in England to make web scraping easier. At the time of writing (2020–2021 season), there are 1056 clubs competing within the English football league system according to this Wikipedia page. I noticed each Wikipedia page on these clubs contained a logo image under an HTML class “infobox-image”, so web scraping was not very hard.

Captured from Wikipedia

I used requests+bs4(BeautifulSoup)+urllib to extract source URLs of club logo images from the Wikipedia pages. I am not going into details on code because it is very specific to my dataset. The code I used for web scraping is as follows:

Preparing dataset for StyleGAN

After removing some clubs from the same region with the same logo and those that did not have logo images on the Wikipedia page, I ended up with a total of 1030 club logos. Most of the downloaded images were PNG files with 4 channels and their width and height varied from small as ~100 pixels to more than 400 pixels.

Now, there are some requirements for input data to be used by StyleGAN.

  • Must have 3 channels (RGB)
  • Must have same width and height
  • Width and height must be a power of 2 (256, 512, etc.)
  • Width and height should be at least 128 (to use pretrained model)

First, I had to add white pixels at boundaries to make images square and turn transparent pixels (4th channel of PNG) into white.

Next, I chose to resize images to 128×128. However, since there were images smaller than 128 pixels, resizing those images would yield very poor image quality. Therefore, I used the Residual Dense Network model (thankfully, a pretrained model is available!) to create super-resolution images out of the small images before resizing.

Dataset is now ready!

Examples from the dataset.

Training

I used NVIDIA’s Official TensorFlow implementation of StyleGAN2-ADA and my hyperparameter configuration is also based on examples on this repository.

Before training, convert the dataset to TFRecords.

python dataset_tool.py create_from_images ./datasets/logo ../YOUR_DIRECTORY

My training was a two-stage process:

#1
python train.py --outdir=./training-runs --gpus=1 --data=./datasets/logo --res=128 --kimg=1000 --mirror=True --gamma=16 --augpipe=bgcfnc --freezed=2 --resume=ffhq256
#2
python train.py --outdir=./training-runs --gpus=1 --data=./datasets/logo --res=128 --kimg=2000 --mirror=True --gamma=16 --augpipe=bgcfnc --resume=PATH_LAST_NETWORK_FROM_1
  1. Transfer learning from FFHQ trained 256 + FreezeD
  2. Resume from 1 with no FreezeD

FreezeD is a technique where fine-tuning is performed with frozen lower layers of the discriminator and it is known to be effective especially when there is limited data.

The training was performed on a GCP VM instance with 1 Tesla V100 GPU. The first stage took 5.5 hours and the second took 11 hours. To monitor the training process, you can either check the FID metric or sample images created by StyleGAN at regular intervals (controlled by --snap).

Image by the author. FID metrics. left: FFHQ256+FreezeD, right: resumed from left with the whole layers.
Image by the author. From the last example images of my training process.

Training GAN can be very tricky. If you are using your own dataset, you should try out different methods with different hyperparameters. This is just how I did it. Please check suggestions for better fine-tuning in the Official StyleGAN repository or this post:

Exploring latent space

Now we have a well-trained StyleGAN that generates high-quality images. To control the styles of the generated images, you need to know which input — called latent vector or latent code in GAN — is mapped to which output. The basic concept of manipulating output in GAN is that if you “walk in the latent space” from point A to B, you will as well get a smooth transition of output from G(A) to G(B).

Unlike most of the previous GANs, StyleGAN makes use of an “intermediate vector” between traditional latent vector and output, which is known to have contributed to excellent feature disentanglement in StyleGAN. You actually only need the intermediate latent vector to control styles.

Source. Note the intermediate latent space W.

There are two ways you can obtain the intermediate latent vectors:

  1. Generate images with random inputs, check the resultant images, and pick ones you want to manipulate
  2. Pick images you want to manipulate first and find inputs that are mapped to output as close as possible to those images

How to generate random images

Like any other GANs, you can generate outputs from random vectors in latent space in StyleGAN. However, after mapping a random latent vector to an intermediate vector, StyleGAN does not directly use the intermediate vector to generate a new image but uses a “truncation trick”. To avoid generating too random images outside the distribution of training data, StyleGAN samples only intermediate vectors that fall inside a certain range around the average intermediate vector of training data. The certain range is determined by a variable truncation_psi ψ in the original paper and code. Given that, you can generate random images and just make sure you keep the intermediate vectors.

How to project images to latent space

Instead of sampling random inputs first and checking the output, you can do the opposite by having the images first and finding the corresponding inputs. The official StyleGAN repository provides a projector code that does the job through an iterative process.

python projector.py --outdir=out --target=targetimg.png --save_video=False --network=YOUR_NETWORK_PATH
Image by the author. Example of projecting an image to latent space in 1000 iterations.

I chose the second option — projecting images to latent space — because it is burdensome to pick among many random outputs. However, drawbacks of the projector are that it is slow and sometimes inaccurate. You may use an encoder model as an alternative, but I will not go into details in this post. If you are interested:

Interpolation

If you use projector.py from the official StyleGAN repository, the output directory will contain a dlatents.npz file along with images of the target and projection. dlatents.npz is the file that has the intermediate latent vector of a projection, so you can now manipulate latent space with it.

One thing is you can interpolate between two points in latent space and you will get a smooth interpolation between the images as well. Coding is relatively straightforward. You just have to perform a linear interpolation between two latent vectors and run the generative model.

Image by the author. Example of interpolation.

Style mixing

Interpolation is cool, but the midpoint two latent vectors might not quite correlate to the mixed version of the two images the way you want. That is because there are levels of features: coarse (e.g. pose, shape) to fine (e.g. color scheme, microstructure). As mentioned in the introduction, StyleGAN is well-known for its strength in feature disentanglement.

An element of intermediate vector variable dlatents has a shape of (1,12,512) and is actually a stack of 12 identical 512-size intermediate vectors. You may check this with all(dlatent[0][i]==dlatent[0][j]) , for any (i, j) in 0~11. This structure is related to the scale-specific control ability of StyleGAN. You can control high- to low-level attributes according to which layer you manipulate. The code for style mixing is as follows:

The key line of the code above is:

mix[0][:6] = dlatents[t][0][:6]

which mixes the styles of the two images. You can change which layers to switch depending on the level of style you want to control.

I created a cross table of style mixing with six images both as sources and targets so that I can see the effect of controlling coarse or fine styles. Note the mixed images in the same row have the same color combination as the target and those in the same column have the same shape as the source. I was quite surprised how good scale-specific control StyleGAN shows even with ~1K training images.

Image by the author. Note the style mixing: the shape of source + color of the target.

Conclusion

You can train StyleGAN2-ADA, which is currently the latest version of NVIDIA’s StyleGAN that allows fine-tuning with limited data, with a custom dataset of ~1K images. Manipulating latent space is possible and the results are satisfying with StyleGAN trained on the custom dataset.

--

--