Generative Adversarial Network
To explain how the generative adversarial works, this article did a very good job explain it.
Another good and newer one.
How DCGANs work
To build a DCGAN, we create two deep neural networks. Then we make them fight against each other, endlessly attempting to out-do one another. In the process, they both become stronger.
Let’s pretend that the first deep neural network is a brand new police officer who is being trained to spot counterfeit money. It’s job is to look at a picture and tell us if the picture contains real money.
Since we are looking for objects in pictures, we can use a standard Convolutional Neural Network for this job. If you aren’t familiar with ConvNets, you can read my earlier post. But the basic idea is that the neural network that takes in an image, processes it through several layers that recognize increasingly complex features in the image and then it outputs a single value — in this case, whether or not the image contains a picture of real money.
This first neural network is called the Discriminator:
Now let’s pretend the second neural network is a brand new counterfeiter who is just learning how to create fake money. For this second neural network, we’ll reverse the layers in a normal ConvNet so that everything runs backwards. So instead of taking in a picture and outputting a value, it takes in a list of values and outputs a picture.
This second neural network is called the Generator:
So now we have a police officer (the Discriminator) looking for fake money and a counterfeiter (the Generator) that’s printing fake money. Let’s make them battle!
In the first round, the Generator will create pathetic forgeries that barely resemble money at all because it knows absolutely nothing about what money is supposed to look like:
But right now the Discriminator is equally terrible at it’s job of recognizing money, so it won’t know the difference:
At this point, we step in and tell the Discriminator that this dollar bill is actually fake. Then we show it a real dollar bill and ask it how it looks different from the fake one. The Discriminator looks for a new detail to help it separate the real one from the fake one.
For example, the Discriminator might notice that real money has a picture of a person on it and the fake money doesn’t. Using this knowledge, the Discriminator learns how to tell the fake from the real one. It gets a tiny bit better at its job:
Now we start Round 2. We tell the Generator that it’s money images are suddenly getting rejected as fake so it needs to step up it’s game. We also tell it that the Discriminator is now looking for faces, so the best way to confuse the Discriminator is to put a face on the bill:
And the fake bills are being accepted as valid again! So now the Discriminator has to look again at the real dollar and find a new way to tell it apart from the fake one.
This back-and-forth game between the Generator and the Discriminator continues thousands of times until both networks are experts. Eventually the Generator is producing near-perfect counterfeits and the Discriminator has turned into a Master Detective looking for the slightest mistakes.
At the point when both networks are sufficiently trained so that humans are impressed by the fake images, we can use the fake images for whatever purpose we want.
The Math & Code
Our Spongebob metaphor only goes so far in helping actually build a GAN. To actually implement one, we need to get a little more formal. The generator (G) and discriminator (D) are both feedforward neural networks which play a min-max game between one another. The generator takes as input a vector of random numbers (z), and transforms it into the form of the data we are interested in imitating (G(z)). The discriminator takes as input a set of data, either real (x) or generated (G(z)), and produces a probability of that data being real (P(x)). We would have the objective function min_G max_D L_GAN(G,D)
Optimize discriminator:
The discriminator is optimized in order to increase the likelihood of giving a high probability to the real data and a low probability to the generated data. max(log(p(x_i)+log(1-p(x’_i))). Because p(x_i) in (0, 1] so the log(p(x_i)) is negative, in the code, we define loss_d =reduced_mean(-log(p(x_i)-log(1-p(x’_i))), so we just need the optimizer to find the lowest possible loss.
This formulation is also noted as:
This is just the standard cross-entropy cost that is minimized when training a standard binary classifier with a sigmoid output. The only difference is that the classifier is trained on two minibatches of data; one coming from the dataset, where the label is 1 for all examples, and one coming from the generator, where the label is 0 for all examples.
Optimize generator:
The generator is then optimized in order to increase the probability of the generated data being rated highly. min(log(1-P(x’_i))). In code, this is equivalent to define loss_g =reduce_mean(-log(p(x’_i)).
By alternating gradient optimization between the two networks using these expressions on new batches of real and generated data each time, the GAN will slowly converge to producing data that is as realistic as the network is capable of modeling. If you are interested, you can read the original paper introducing GANs here for more information.
To update the weights and compute the gradient for the generator, we treat the generator and the discriminator as a whole network and freeze the weights of the discriminator and only update the ones of the generator.
See it through TensorFlow Code
with tf.variable_scope('G'):
z = tf.placeholder(tf.float32, shape=(None, 1))
G = generator(z, hidden_size)
with tf.variable_scope('D') as scope:
x = tf.placeholder(tf.float32, shape=(None, 1))
D1 = discriminator(x, hidden_size)
scope.reuse_variables()
D2 = discriminator(G, hidden_size)
loss_d = tf.reduce_mean(-tf.log(D1) - tf.log(1 - D2))
loss_g = tf.reduce_mean(-tf.log(D2))d_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1) \ .minimize(self.loss_d, var_list=self.d_vars) g_optim = tf.train.AdamOptimizer(config.learning_rate, beta1=config.beta1) \ .minimize(self.loss_g, var_list=self.g_vars)
A more and complete explanation is available here.
How to Set Up the Generator and the Discriminator?
Check out the DCGAN for converting deep neural network into the adversarial models.
How does the Training Process Look like?
However, although we have a sketchy idea how does GAN works and how to make up a DCGAN, the training process is still very difficult. Not a lot of paper has told you what do you specifically need to expect of the generator and the discriminator’s loss change.
The possible training loss graph of generator and semgentor is at the beginning, both networks’ loss will decrease. As time goes by, before the equilibrium, the direction of the loss change is counterpart. Finally both of them could get equilibrium (stable), and the p_real and p_fake is going to keep around 0.5.
Potential Problems
Sometimes you can see the generated images seems all the same no matter what input you gave. Which means the generator is just to generate the mean distribution of the real data, this happened in the face synthesis. From my understanding, this is because the faces are quite similar since they all belong to the same categories. We can try to decrease the learning rate for this problem. Other ways we can try is early stopping and a more appealing solution is to address the problem directly by giving the discriminator the ability to examine multiple examples at once.
For example, when I was using Conditional Adversarial Autoencoder, the learned latent vector Z from the original images is a very narrow distribution.
For the with dz: and learning rate to be 0.0002, 2*10^-4.
The problem of the generator collapsing to a parameter setting where it outputs a very narrow distribution of points is “one of the main failure modes” of GANs according to a recent paper by Tim Salimans and collaborators at OpenAI. Thankfully they also propose a solution: allow the discriminator to look at multiple samples at once, a technique that they call minibatch discrimination.
In the paper, minibatch discrimination is defined to be any method where the discriminator is able to look at an entire batch of samples in order to decide whether they come from the generator or the real data. They also present a more specific algorithm which works by modelling the distance between a given sample and all other samples in the same batch. These distances are then combined with the original sample and passed through the discriminator, so it has the option to use the distance measures as well as the sample values during classification.
The method can be loosely summarized as follows:
- Take the output of some intermediate layer of the discriminator.
- Multiply it by a 3D tensor to produce a matrix (of size num_kernels x kernel_dim in the code below).
- Compute the L1-distance between rows in this matrix across all samples in a batch, and then apply a negative exponential.
- The minibatch features for a sample are then the sum of these exponentiated distances.
- Concatenate the original input to the minibatch layer (the output of the previous discriminator layer) with the newly created minibatch features, and pass this as input to the next layer of the discriminator.
In TensorFlow that translates to something like:
def minibatch(input, num_kernels=5, kernel_dim=3):
x = linear(input, num_kernels * kernel_dim)
activation = tf.reshape(x, (-1, num_kernels, kernel_dim))
diffs = tf.expand_dims(activation, 3) - \
tf.expand_dims(tf.transpose(activation, [1, 2, 0]), 0)
abs_diffs = tf.reduce_sum(tf.abs(diffs), 2)
minibatch_features = tf.reduce_sum(tf.exp(-abs_diffs), 2)
return tf.concat(1, [input, minibatch_features])
We implemented the proposed minibatch discrimination technique to see if it would help with the collapse of the generator output distribution in our toy example. The new behaviour of the generator network during training is shown below.
When you get trash images what could potentially be happening is that either the Discriminator or the Generator has become overpowered, the nash equilibrium has collapsed.
The other network is then unable to improve as there is no varying gradient to guide it.
The most intuitive way to enforce nash equilibrium(i.e. a constantly fair game for both networks) is to only update the current worst of the two networks at each iteration, in exchange for this throwing away of half of your potential weight updates you ensure that neither of the two networks can become too powerful. Why cant we get the nash equilibrium by finetune it? This could be useful for tasks that we could come up with a explicit loss function and
Another potential way of enforcing nash equilibrium that has not been explored is having multiple D and G networks where each network’s score is a weighed sum of its performance against all of its adversaries. Aside from being bigger and more interesting than standard GAN it will be harder for nash equilibrium to fall as all of a network’s adversaries have to become overpowered before gradient variation vanishes.