Do all GANs perform the same?

Published in

CodeX

6 min readAug 27, 2021

Generative adversarial networks, as you have probably heard about now are a framework capable of learning a distribution from the competition between the generator and discriminator. The generator learns to generate samples hopefully indistinguishable from real data, and the discriminator learns to classify whether the given image is real or fake. Since the invention of GANs, they had overgone various improvements and are considered a powerful tool used actively in various problems, especially in generative and reconstruction tasks.

A great deal of work has focused on the fundamental objective of GANs, that is the training loss of GANs. However, this study shows convincing evidence that they do not matter in terms of the performance of GANs. In fact, with sufficient hyperparameter search, almost all algorithms have random ranks and even the most recent works perform similarly to the original GAN proposed by Ian Goodfellow.

This paper …

Compares the performance of various GAN losses fairly on large-scale experiments.
Proposes precision and recall as domain-specific metrics of performance.
If you want to get insights about GAN training, jump straight to the conclusion.

Original Paper: Are GANs Created Equal? A Large-Scale Study

Classic methods on Evaluating GANs

One challenge in the study of GANs is about quantitative metrics to evaluate the quality of generated images. Two commonly used metrics are Inception Score (IS) and Fréchet Inception Distance (FID). These methods both rely on a classifier trained in image recognition. We will shortly discuss the characteristics of these metrics.

IS combines the idea that label distribution should have low entropy when a meaningful object exists, and the variability of the samples should be high. It is calculated based on the feature distribution of generated images. However, IS is not considered a proper distance, perhaps because it doesn’t incorporate the distribution of true images in any way(not clearly elaborated in the paper).

FID measures the distance in the statistics in the features space of the pretrained classifier network. We view the feature as a Gaussian based on the mean and covariance of real and fake images and measure the Fréchet distance between the two Gaussians. FID solves a problem of IS called intra-class mode dropping where for example a model that generates only one image per class can get a good IS but will have a bad FID. Moreover, they are more reliable for measuring image quality, according to previous experiments.

Precision and Recall

Both FID and IS don’t have the ability to detect overfitting e.g. prevent the network from memorizing the training samples perfectly. We design a method to complement the weakness of FID in evaluating GAN performance.

Precision, recall, and F1 score are metrics widely used for evaluating the quality of prediction. We construct a toy dataset with a data manifold such that the distance from samples to the manifold can be computed efficiently. We can intuitively evaluate the quality of the sample according to the distance to the manifold. If the samples from the model distribution are close to the manifold, its precision is high, and if the generator can recover any sample from the manifold, high recall.

The toy dataset is described in the figure above as a distribution of gray-scale triangles. We define the distance from a sample to the manifold as the squared Euclidean distance to the closest sample in the manifold. Precision is defined as the ratio of samples with a distance below δ = 0.75. We invert n samples from the test set into latent vectors z* and compute the distance between x and G(z*). By inverting, we find the latent that can most closely recover the given image or precisely, solving the equation below. The recall is defined as the ratio of samples with a distance below δ. Their intuitive concepts were explained above.

Inverting G

Various GANs

The designs and loss of GANs differ based on the problem, but the experiments will explicitly focus on unconditional image generation. The GANs described above might seem similar because abstractly, the generator and discriminator are optimizing the opposite objective of each other in some way. However, they not only differ in how each loss of the generator and discriminator is computed but aim to optimize fundamentally different distances.

The original GAN(MM GAN) framework approximates the optimization of the JS divergence between the generated and real distribution. WGAN and WGAN-GP minimize the Wasserstein distance under the Lipschitz smooth assumption. LSGAN minimizes the Pearson χ² divergence. Each GAN has very different theoretical backgrounds in terms of the type of distance between distributions and how to approximate the distance since they are not computable in most cases.

For more detail on how each loss was formulated, refer to the corresponding papers. I personally found them insightful in terms of understanding the background of the current methods for learning GANs.

Experiment Design

The evaluation metrics must be effective, fair and must not add too much computation. Therefore, we use the FID score and precision, recall, and F1 as the metric. The performance of the model often varies on hyperparameters, randomness(initialization), or the dataset.

To counteract the effects from components of the algorithm apart from the loss, we

Use the same INFO GAN architecture for all models. (Except BEGAN, VAE where an autoencoder is used)
Perform hyperparameter optimization for each dataset.
Start with random seeds.
Experiment on 4 small~medium sized datasets(CelebA, CIFAR10, Fashion-MNIST, MNIST).
Train on multiple computation budgets.

The authors find hyperparameter search necessary according to the figure above where the searched hyperparameters had a significant impact on the final performance. Details about the hyperparameter search process is provided in the original paper.

Conclusion

Shows evidence that algorithmic differences in state-of-the-art GANs are not so relevant, but the hyperparameter search has much more impact.

Optimal hyperparameters depend largely on the dataset. As the figure above suggests, the black star hyperparameters transfer terribly to different datasets except for LSGAN.

Because there is always variance in the final performance due to random seed, we must compare the distribution of runs for a fair comparison.

Many models have bad F1 scores and seem to suffer improvement when optimized on it.