GANs for Data Augmentation

Sam Nolen
Abacus.AI Blog (Formerly RealityEngines.AI)
5 min readJul 2, 2019
Even imperfect synthetic data can improve your classifier’s performance.

Generative adversarial networks, or GANs, were introduced by Ian Goodfellow in 2014 and have been a very active topic of machine learning research in recent years. GANs are unsupervised generative models which implicitly learn an underlying distribution. In the GAN framework, the learning process is a minimax game between two networks, a generator, which generates synthetic data given a random noise vector, and a discriminator, which discriminates between real data and the generator’s synthetic data.

Generative Adversarial Network framework
Photorealistic generated images (BigGAN, 2018)

GANs have been applied to many domains with striking results, particularly in computer vision. In this post, we’ll explore a less flashy but impactful use case for GANs, data augmentation for improving the performance of classifiers in supervised learning.

Data Augmentation

Getting a larger dataset is one of the most reliable ways to improve the performance of a machine learning algorithm — to use a phrase of Andrew Ng, “scale drives machine learning progress”. In some cases, adding generated or synthetic data, a process known as data augmentation, can also improve performance.

The most familiar way of doing this is to apply some transformations to existing data. In the case of image classification, we know, for example, that after shifting or reflecting an image of a cat, it’s still an image of a cat. So image classification datasets are often augmented with shifts, reflections, rotations, or color alterations to achieve the best results possible.

Traditional data augmentation for images

Here’s a question: can we use a GAN to generate synthetic data to improve a classifier? In an April 2019 paper, Data Augmentation Using GANs, the authors generated totally synthetic data for a binary classification problem (cancer detection). Strikingly, they showed that a decision tree classifier performed better when trained on this totally synthetic dataset than when trained on the original small dataset.

However, this seems to be an exceptional case, and this straightforward approach to data augmentation has a better chance of working on very small datasets. In one 2017 paper, The Effectiveness of Data Augmentation in Image Classification Using Deep Learning, the authors found that straightforward data augmentation using GANs was less effective than other augmentation strategies.

Data augmentation in the few-shot context

So let’s modify our question: what if we have a very small class as a subset of a larger dataset, such as a rare dog breed in an image dataset? Or, what if we are training a fraud classifier, but we only have a few known examples of fraud, and many instances of non-fraud? This situation is known as few-shot learning, and this turns out to be a more promising use case for data augmentation using GANs. But to tackle this we need to include class information in our GAN model.

We can do this using a conditional GAN, in which class information is fed to the generator. We’ll now discuss three variants of conditional GANs from the past two years.

ACGAN: Cooperate on classification

One variant of a conditional GAN, called ACGAN (Auxiliary Classifier GAN), has the discriminator perform classification in addition to discriminating between real and synthetic data, and the loss function includes a binary cross-entropy term for classification. This incentivizes the generator to learn representative class samples in addition to learning to generate samples which are realistic overall. This is essentially multi-task learning: although the generator and discriminator are “competing” on whether a generated image is real or fake, they are “cooperating” on classifying it correctly.

DAGAN: learn a shared family of transformations for data augmentation

Another variant, called DAGAN (Data Augmentation GAN), learns how to generate a synthetic image using a lower-dimensional representation of a real image. Rather than the generator taking as input a class and noise vector, in the DAGAN framework, the generator is essentially an autoencoder: it takes an existing image, encodes it, adds noise, and decodes it. So, the decoder learns a large family of transformations for data augmentation.

The DAGAN generator

The DAGAN discriminator distinguishes between an image and a transformed version on the one hand, and a pair of images from the same class on the other hand. So, the discriminator incentivizes the decoder to learn transformations which do not change the class, but which are non-trivial in the sense that the transformed image is not too similar to the original image. However, a key assumption of the DAGAN is that the same transformations apply to all classes — this is reasonable in the computer vision context, but less so in fraud or anomaly detection.

The DAGAN discriminator

BAGAN: learning to balance imbalanced data

In yet another conditional GAN variant, known as BAGAN (BAlancing GAN), an autoencoder is also used for the generator. The autoencoder is pre-trained to learn the distribution of the overall dataset. Then, multivariate normal distributions are fit to the encoded image of each class. Now you can sample from these multivariate normals and pass the resulting conditional latent vector to the generator. Unlike DAGAN, BAGAN gives you a full-fledged conditional generator, rather than transformations for existing data. It may also be better than ACGAN in a few-shot context, because of the VAE’s ability to learn the overall distribution before fitting normals to each class.

BAGAN framework

Final Thoughts

While even naive data augmentation with GANs can sometimes boost classifier performance, especially in the case of very small or limited datasets, it seems that the most promising situations for augmentation with GANs involve transfer learning or few-shot learning. As research continues to improve the stability and reliability of GAN training, it will not be surprising to see rapid advances in the use of GANs for data augmentation.

--

--