Generative Adversarial Networks

… or how to create realistic-looking counterfeit images

https://arxiv.org/pdf/1511.06434.pdf

None of these bedrooms are actual real photographs. They were generated by a game theory-exploiting system of models called a Generative Adversarial Network.

What else can GANs do?

Generative Adversarial Networks have the ability to create new images and new text snippets, given some original data to imitate. They can turn an outline or a rough sketch into a full-color image and a written description into a realistic picture. Here are some examples:

1) Generating new labeled training data

Suppose you’d like to train a model to classify flowers vs. trees but you don’t have enough labeled training images of flowers. A GAN can use the flower images you have now to generate (from scratch) additional realistic-looking flowers.

2) Generating images from outlines

https://arxiv.org/pdf/1611.07004v1.pdf

Given the outline of some facades and the actual facade images, the GAN can learn to turn outlines into pictures. This will be incredibly useful in the visualization and design realms.

3) Generating images from text

https://arxiv.org/pdf/1605.05396.pdf

Given a text description of a bird, a GAN can learn to generate a new bird image to match the text. None of the bird images here were part of the training data fed to the GAN. The GAN generated these images on its own.

How do GANs work?

The graphic below explains more detail about how GANs work, but for a high-level overview it is often helpful to think of a counterfeiting machine and a detective. The goal is to get the machine to create data that looks so convincingly real that the detective (and the humans using the GAN’s output) can’t tell the data was made by a machine instead of by a true process.

A GAN needs two initial inputs: some real data (images, text, voice recordings, a simple normal distribution) and some random data. The generator (counterfeit machine) tries to turn the random data into something that closely resembles real data. The discriminator (detective) tries to tell the difference between real data and generated data. The discriminator’s feedback helps the generator to produce more and more realistic-looking fake data.

An illustration of the “fight” between generator and discriminator:

GAN History

GANs were introduced in 2014 by Ian Goodfellow. Previously, deep learning was mainly used for discriminitive modeling — classification tasks that assign a label to discriminate between classes. GANs are deep learning’s first major foray into generative problems.

Yan LeCunn, an AI researcher at Facebook, named GANs as one of the most interesting recent developments in deep learning. He says this (1) because a trained GAN discriminator knows a lot about the true data and so it can be used to extract the most meaningful features in image classification tasks, and (2) because GANs don’t use traditional loss functions. This is specifically important in the image classification and prediction realm where the standard mean squared error loss often works poorly. Averaging pixel values of two pictures generally produces one blurry new image.

A GAN experiment

For my experiment, I modified a TensorFlow implementation of a GAN generating a 1-D normal distribution. I explored the effect of changing the optimizer, the random seed, the mini-batch usage, the number of hidden layers in the MLPs, the distribution and the spread of the distributions.

Optimizer Modifications

One modification I made to the code in all instances was to switch from using the Gradient Descent Optimizer to using the Adam optimizer, as Adam has been suggested to perform better. I found this to be the case as well.

optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss, global_step=batch, var_list=var_list)

I found the Gradient Descent Optimizer to run particularly poorly with certain random seed choices. It seems that Adam is a more robust solver. As a demonstration, I changed nothing else except the optimizer between these two implementations:

Distribution = Normal, mu = 4, sigma = 0.5, generator hidden layers = 4

Effect of the Mini-batch Technique

A generator that has learned to output the true mean value over and over would fool the discriminator quite well. But fake data that’s just the same number over and over isn’t very useful. We want a distribution of values. Using mini-batch discrimination can fix this. Mini-batch allows the discriminator to look at several iterations of samples at once and measure the distance between the samples. Then, we can add a parameter to the discriminator’s network that’s more likely to declare “fake” when the samples are very close together, as would be the case if the generator had just learned to sample the true mean value over and over.

In the top left plot below, we can see that using the mini-batch technique keeps the generator’s point as spread out as the true data. In the GAN without mini-batch (top right), the points collapsed in towards the true mean. I was surprised to see how low the discriminator’s error is in the mini-batch case (bottom left), even though it makes sense that the discriminator would perform better with mini-batch capabilities than without. Looking at multiple samples at once gives the discriminator more information to use in assigning the samples a real vs. fake probability. The generator’s loss, consequently, is much higher in the mini-batch case because it doesn’t fool the discriminator as well. Yet somehow, the generated data appears to match the real data nearly perfectly in that top left image.

Distribution = Normal, mu = 4, sigma = 0.5, generator hidden layers = 4

Random Seed Robustness

There are still some random seed choices that produce poor results with Adam. The results are not as erratic as with the Gradient Descent Optimizer, though. The chart below explores several random seed choices. It appears that using the mini-batch technique can mitigate some of these poor results (though, not in the case of random seed 67), but it would be important to try several seeds when using a GAN in an applied situation.

Distribution = Normal, mu = 4, sigma = 0.5, generator hidden layers = 4

Varying Hidden Layers

We want the generator’s error to be lower than the discriminator’s error so that the discriminator is lenient enough to let the generator come up with some data that is a decent approximation of the truth. Because of this, the discriminator has twice as many hidden layers as the generator (at least in the code I modified).

Four hidden layers appears to produce the optimal generator performance with or without mini-batch. Though changing the number of layers often doesn’t make a huge change in the GAN’s performance or the generator’s loss. Here are some examples of the change in performance by adding one hidden layer:

Top: random seed = 55, bottom: random seed = 42

Testing a Bimodal Distribution

I wanted to see if a GAN could learn to imitate a bi-modal distribution. The bi- modal distribution I created samples points from two normal distributions with means and sigmas chosen so that there would be some overlap between the two individual distributions.

samples = np.concatenate((np.random.normal(self.mu1, self.sigma1, int(N/2)), np.random.normal(self.mu2, self.sigma2, int(N/2))), axis=0)

The GAN did not generally have success learning bi-modal distributions:

However, I was able to cherry pick a few combinations of parameters that led to somewhat successful results. The real data distribution imitated by this GAN had means of -2 and 2 and sigmas of 0.5 and 0.7. It’s interesting to note how the discriminator’s error continues to oscillate, perhaps between the two modes.

Using the mini-batch technique did not improve the results here or in other cases. Additionally, I had to switch back to the Gradient Descent Optimizer to get these results. The Adam Optimizer produces this sub-par result given the same means and standard deviations as the attempt above:

Varying Sigma

I wanted to see in the case of a uni-modal distribution, what is the largest value of sigma such that we can still see decent results from a GAN. Using simga = 0.55, the results look okay and the loss converges quickly:

With sigma =1, the generated data fits the true distribution less closely and both loss functions are more variable:

With sigma = 2, the generated data is at least within the bounds of the real data samples.

But, with sigma =3, neither the generator, nor the discriminator are powerful enough to pick up on the nuances of this large variance:

Next Steps:

With more time, I’d like to see if these findings apply to the case of 2-dimensional distributions and I’d like to experiment with GAN applications to image generation.