Training GANs With Limited Data

StyleGAN2 with Adaptive Discriminator Augmentation (ADA)

Published in

The Startup

6 min readOct 28, 2020

One of the long standing challenges with Generative Adversarial Networks (GANs) has been to train it with little data. The key problem with small datasets is that the discriminator quickly overfits to the training examples. The discriminator’s job is to classify its inputs as either fake or real, but due to overfitting, it rejects everything other than the training dataset as fake. As a result, the generator receives very little feedback to improve its generations and the training collapses. In this article we discuss a recent work by Karras et al. [R1] that tackles this problem via Adaptive Discriminator Augmentation.

Fig 1. **GAN Training Objective** — match generated image distribution x and real image distribution y. **Left**: x != y, **Right**: x = y

In almost all areas of deep learning, data augmentation is the standard solution against overfitting. For example, training image classifiers under rotation, noise, blur, etc. leads to increasing invariance to these semantics-preserving distortions — a highly desirable quality in a classifier. However, this doesn’t work directly for training GANs, as the generator would learn to produce the augmented distribution. This “leaking” of augmentations to the generated samples is highly undesirable.

The authors propose an augmentation technique called stochastic discriminator augmentation to overcome this “leaking” issues. They evaluate the discriminator only using augmented images, and do this also when training the generator. Discriminator augmentations corresponds to putting distorting goggles on the discriminator, and asking the generator to produce samples that cannot be distinguished from the training set when viewed through the goggles.

Fig 2. **Left**: Naive Discriminator Augmentation, **Right**: Stochastic Discriminator Augmentation

*Fig 3.* **Augmented GAN Training Objective** — match the augmented generated and real image distributions Tx and Ty respectively. **Left**: Tx != Ty, ***Right****: T*x = Ty

However, this approach only works if the distortions are represented by an “invertible” transformation of the probability distributions over the data space. It is crucial to understand that this does not mean that the augmentations performed on individual images would need to be undoable. For instance, an augmentation as extreme as setting the input image to zero 90% of the time is invertible in the probability distribution sense: it would be easy, even for a human, to reason about the original distribution by ignoring black images until only 10% of the images remain.

Let us borrow another example from the paper to intuitively understand invertible transformations. Random rotations chosen uniformly from {0, 90, 180, 270} degrees are not invertible (Fig 4): it is impossible to discern the differences among orientations after the augmentation. Here, the generator is free to produce upside down images, since the relative occurrence of {0, 90, 180, 270} degrees in the augmented distribution is still the same. As a result, though the generator learns to match the augmented distributions Tx and Ty, the underlying goal of matching the fake and real data distributions x and y remains unsolved.

Fig 4. **Non-invertible example**: 90 degree rotations

The situation changes if this rotation is only executed at a probability p < 1 (Fig 2, Right): this increases the relative occurrence of 0 degree, and now the augmented distributions can match only if the generated images have correct orientation (Fig 5). Now, when the generator tries to produce upside down images, the augmented distributions no longer agree. Hence, the generator is forced to match the fake distribution x to the real distribution y in order to match the transformed distributions Tx and Ty.

Fig 5. **Invertible example**: 90 degree rotations with uneven probability (p=0.8)

More generally, if we apply an invertible transformation T to the generated and real distributions x and y, then it is sufficient to match augmented distributions Tx and Ty in order to match the original distributions x and y (Fig 6).

Fig 6. Left: Invertible Transformation, **Right**: Non-invertible Transformation

Theoretically, if the augmentation operator T is “invertible”, there exists one and only one x for the augmented distribution Tx, and there should be no “leaks” in x. However, in practice, due to limitations of finite sampling, finite representational power of the networks, inductive bias and training dynamics, very high values of p leads to leaking of augmentations in the generated images. Authors also show that the optimal value of p is highly sensitive to dataset size. They address the concerns of doing a costly grid search on p by making the process adaptive. Before diving into the the details, let us first understand how we can measure overfitting in GANs.

For different dataset sizes, training starts the same way in each case, but eventually the progress stops and FID starts to rise (Fig 8). Also, the discriminator output distributions for real and generated images overlap initially, but keep drifting apart as the discriminator becomes more and more confident (Fig 7), and the point where FID starts to deteriorate is consistent with the loss of sufficient overlap between the distributions (Fig 8).

Fig 7. As the training progresses, the overlap between discriminator output distributions for real and generated images decreases

The standard way of quantifying overfitting is to use a separate validation set and observe its behavior relative to the training set. When overfitting kicks in, the validation set starts to behave increasingly like the generated images, and the discriminator outputs for real and generated samples begin to diverge. As evident from Fig 8, this happens earlier for smaller datasets.

Fig 8. Evolution of FID and discriminator raw logits during training for different dataset sizes.

The authors propose two plausible overfitting heuristics to measure overfitting (Fig 9). The first heuristic rv, expresses the output for a validation set relative to the training set and generated images. The numerator is 0 when the training and validation set behave exactly the same, hence r=0 means no overfitting. The numerator and denominator are the same when the generated and validation set behave exactly the same, hence r=1 indicates complete overfitting.

Since it assumes the existence of a separate validation set in an already small dataset, it is not feasible to calculate the rv heuristic. Hence, the authors turn to rt - which estimates the portion of the training set that gets positive discriminator outputs - to identify overfitting and dynamically adapt the augmentation probability p as the training progresses:

rt too high → augment more (increase p)
rt too low → augment less (decrease p)

In summary, the authors propose a plug-and-play technique to convert non-invertible image augmentations into invertible transformations by introducing a probability p by which the augmentation is applied. They also provide a mechanism to adaptively tune these augmentations to improve the GAN training process to optimality. For more details, I highly recommend reading the original paper.

References

Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila. Training Generative Adversarial Networks with Limited Data. In NeurIPS 2020.
Authors’ presentation in the NVIDIA GTC conference.

Training GANs With Limited Data

StyleGAN2 with Adaptive Discriminator Augmentation (ADA)

References

Written by Mayank Agarwal