GAN — A comprehensive review into the gangsters of GANs (Part 2)

Jonathan Hui
Jun 28, 2018 · 14 min read

This article studies the motivation and the direction of the GAN research in improving GANs. By reviewing them in a single context, we understand the thought process and allow us to have our own judgment on the solution approach. In this article, we provide information for different approaches with links for followup. Unless you are stuck, read the article first before divert your attentions to the links. If you miss part 1 of this article covering the applications and issues of GANs, here is the link.

The following figure summarizes the GAN design.

We conceptualize into three categories in improving the training of GANs:

  • Network design
  • Cost function
  • Optimization

Network design


One of the most popular network design for GANs (or originated from) is DCGAN.


It gets rid of max-pooling which destroys spatial information and hurts the image quality. Its main design includes:

  • Replace all max pooling with convolutional stride,
  • Use transposed convolution for upsampling,
  • Eliminate fully connected layers, and
  • Use Batch normalization BN.


If we can take advantage of any information, we should. Conditional GAN (CGAN) takes the label information from the samples and gives a head start to the generator to create images. In the second figure, both the generator and the discriminator take the labels as input. It creates better images but it requires labeling your training samples.

Stacked or progressive GAN

If you add the word “stack”, “progressive” or “hierarchy” in front of any deep network model, you guarantee yourself a new research paper. However, SGAN (Stacked GAN) and progressive GAN deserve better attention. They produce some of the best images so far.

We can train a large model at once or divide the network into sub-layers and train them one at a time. For GAN, the training is hard and the divide and conquer approach works. As shown below, the network is broken into 3 major sub-layers and we train from sub-layer 1 to sub-layer 3 in three separate phases.

Modified from source

After we train each major sub-layer separately, we train the whole network jointly to improve performance (in the middle figure below).

Modified from source

In another GAN design, the progressive GAN upsamples or downsamples images by 2 in each sub-layer. In phase 1, we train an generator to generate an image of 4 × 4. After the training is completed, we add an upsample layer to double the output resolution to 8× 8.

By doing it 9 times, we eventually produce images of 1024 × 1024. The following indicates how we build up the resolution in layers of generator and discriminator.


For those interested in developing GAN applications, it will be worthwhile to take a look at these articles SGAN and progressive GAN in our series. It produces some of the best images so far.

In general, attention based deep learning has improved accuracy significantly. In self-Attention Generative Adversarial Networks (SAGAN), we locate the attention area to improve the rendering of a specific area. For example, to refine the eye better (the red dot), SAGAN collects information from the attention area (the highlighted area in the middle) to refine the eye. In GANs, we often have problems in creating structure. For example, the legs of the dog look odd. Attention based GANs build structure better by relating the right context together.


Cost function

  • We add new penalties to the cost functions for new constraints.
  • We borrow tips from the deep learning to avoid overfitting, and
  • But yet the most active research area is finding new cost functions with smoother non-vanishing or non-exploding gradients everywhere.

New penalties

In deep learning, we add additional cost to an objective function to enforce constraints. To mitigate mode collapse, we want the diversity found in the generated images to be similar to that of real images. In Minibatch discrimination, we separate generated images and real images into different batches. For each sample in a batch, we compute its similarity with other samples in the same batch and feed this information to the discriminator. If mode drops in the generator, the similarity increases. The discriminator can spot generated images from this additional parameter and punish the generator accordingly.

Avoid overconfidence

Overconfidence hurts, in particular for any deep learning training. One-sided label smoothing penalizes the discriminator when its prediction for any real images go beyond 0.9.

New cost functions and new goals

The discriminator doesn’t want to be cheated by the generator. However, this goal can be too narrow. In theory, the discriminator can be 100% accurate by detecting a small set of features that the generated images are missing. This can turn into a greedy optimization process that modes drop and models destabilize. While we want the data distribution of the generated images converges to that of the real images, how to get there is important. If the gradient vanishes or destabilizes, the training will fail.


Next, we will discuss some new cost functions that are heavily studied. But let’s recap the cost functions in the original GAN paper first.

In GAN, two objective functions are proposed. The original cost function has a vanishing gradient issue and an alternative cost function is proposed. This indicates that even researchers may start with a sound mathematical model, they may fall back to empirical study and intuition to refine their model.

Feature matching

Feature matching proposes a new cost function for the generator with the objective that the generated images should match the statistic of the real images. For example, we compute the means of the features f(x) in the minibatch containing real images and generated images separately. Then we use their L2 distance to train the generator. This ensures the generated images have features resemble the real.


Many researchers have studied different divergence methods, in particular in the form of f-divergency, to see whether it can train GANs better.


Intuitively, LSGAN wants the target discriminator label for real images to be 1 and generated images to be 0. And for the generator, it wants the target label for generated images to be 1. This is very close to the GAN’s alternative cost functions except that LSGAN squares the difference (instead of the cross-entropy) in calculating the cost.

The reason behind the use of square is partially based on this mathematical equation:

With specific values of a, b and c, optimizing the equations above are the same as optimizing a Pearson χ2 divergence (a f-divergence). With intuition and empirical study, LSGAN settles down with the cost functions mentioned before. For those interested, more details can be found in the LSGAN article of our GAN series.


Wasserstein distance is proposed to measure the difference between the data distributions of real and generated images. Intuitively, Wasserstein distance measures the effort to transform one data distribution to another. We can imagine the boxes on the left-hand side below is some kind of data distribution and we want to move them to the dotted spots. There are many spots a box can take (before it is taken) and we charge every box by its moving distance. The Wasserstein distance is the minimum cost to move the boxes to the new spots (distribution).

We want the gradients of the cost functions to be smooth and non-vanishing everywhere. In the plot below, the blue line is the ground truth and the green line is the data distribution for the generated data. We plot the cost as the green line move away from the blue line. For the original cost functions in the GAN paper (the red line), most areas are saturated with zero or close to zero gradients. The light blue line is the Wasserstein distance which the gradients do not vanish or explode. Mathematically, Wasserstein distance looks more desirable as a cost function.


Here is the mathematical definition of the Wasserstein distance.

As there are many ways to move the boxes, finding the cheapest cost (the Wasserstein distance) is not easy. We have to build a deep network f to estimate it and learn the model parameters through deep learning training. Our new cost functions will be in the form of:

Let’s compare it side-by-side with GAN and it looks almost the same as GAN. So training the original GAN is almost the same as training WGAN.

From another perspective, f behaves like a critic estimating the value function of a state. In this context, f measures how good x is. However, we miss one thing very important. To estimate the Wasserstein distance, f must be a 1-Lipschitz function following this constraint:

WGAN enforces it by weight clipping.

WGAN requires the weights of the network f to be within a range controlled by hyperparameter c. (which is set to 0.01 in their experiments)

However, c is not easy to tune. Drop it slightly, the gradient vanishes and increase it a little bit, the gradient explodes. To address that, WGAN-GP enforces the 1-Lipschitz by adding a gradient penalty instead.

To understand WGAN and WGAN-GP further, we will let you to read another article in our series later. Here, WGAN and WGAN-GP introduces a new way to measure the cost by finding the Wasserstein distance between the real and generated images.

Energy based GAN (EBGAN) & Boundary Equilibrium GAN (BEGAN)

Many GANs compose of an encoder and a decoder, and add a reconstruction cost to encourage all important features to be captured by the encoder.

EBGAN (energy based GAN) replaces the discriminator below with an autoencoder, an encoder followed by an decoder.

The new discriminator uses the reconstruction cost (MSE) to criticize the real and generated images (D(X)) instead. Intuitively, instead of distinguishing the real and generated images directly, we also train the discriminator to reconstruct (encode and decode) real images nicely. If the input can be reconstructed well, we consider it is real.

We change the cost functions such that the discriminator needs to be

  • a good critic: so it distinguishes real and generated images well, and
  • a good reconstructor: so it captures all features and nicely reconstructs real images.

Here are the cost functions:

The added constraint motivates EBGAN to have broader goals and avoids greedy optimization. It ensures EBGAN generating images with features found in natural images.

BEGAN builds with the same EBGAN autoencoder concept for the discriminator but with different cost functions. It acts on the proposition that

  • when the difference between the discriminator output D(X) on real and generated images decreases,
  • the data distributions between real and generated images also converge.
BEGAN measures the difference of the D(X) outputs for real and generated images

BEGAN uses a simple approximation to measure the Wasserstein distance between the difference in D(X), simply | m2 - m1|. We can further simplify it to:

where L is the output of the autoencoder (the discriminator).

BEGAN is one of the first GAN that demonstrates more pleasant look portraits.

Does image quality improve when modes drop? At least, it is seen in many experiments even it has not been fully linked with each other. BEGAN discriminator has two goals, being a good critic and a good reconstructor. The first one helps mode diversity while the second one helps image quality. Sometimes, these goals act against each other. BEGAN provides a hyperparameter γ to balance them. In experiments, as γ drops, the image quality improves but mode drops.


Cost v.s. image quality

The original cost function in GAN measures how well we are doing compared with our opponent. It is not a good indicator for image quality. Often, the generator cost increases even the image quality is improving.

Many new cost functions directly and indirectly estimate the difference between the ground truth and the model. This helps the model selection and hyperparameter optimization. It also removes the need of another external network to measure the image quality and mode diversity, like using the Inception score.


What is the best cost function?

Trends to new cost functions

Second, the D(x) output for the discriminator is often computed as

rather than computing D(x) and D(G(z)) separately. This trend is very important because when the discriminator is optimal (performing well), the discriminator seems to stop learning from real images and learns mostly from generated images only. In fact, based on this idea, RGAN and RaGAN have open a new way in designing cost functions based on the difference of the outputs of the discriminator (C(xr) - C(xf) — xr and xf are the real and generated images)

By design or not, some new cost functions add regularization, through parameter clipping, Lipschitz constraint or gradient penalty, to the discriminator model which makes it harder to be optimal or overfitted. This balances the training between the discriminator and the generator better.

In addition, the goal for D(X) equals to 1 for real images and 0 for generated images in both generator and discriminator cost functions may not be desirable. When GAN is in equilibrium, the optimal and D(X) should be 0.5. In many cost functions, the target labels are changed (mainly based on intuition verified by empirical analysis).


Table modified from here.

Maybe you can think about what is next to try.


Experience replay plays back recent generated images with the current one in each iteration of the optimization. Hence, the discriminator is not optimized for just a single point of the generator.

Historical averaging and model averaging keep track of previous model parameters. These methods penalize or lower changes that are different from their historical averages. This may help GAN models that do not converge well with gradient descent.

We anticipate opponent moves when playing games. In Unrolled GAN, we simulate how the discriminator may optimize itself for the next k iterations, and we optimize the generator based on this k-step look ahead.

Unrolled the discriminator optimization 3 times to optimize the generator.

Both historical averaging and Unrolled GAN borrow a page from the meta-learning. We learn from what we learn. We collect information from multiple iterations of training and learn from them to make a single round of model changes. This smoothes out noisy data and avoids greedy optimization that targets on a single point of time only.

Other tips

For those interested in cost functions, we provide four articles detailing WGAN/WGAN-GP, EBGAN/BEGAN, LSGAN, RGAN and RaGAN. It opens up some interesting research approaches.

Final thoughts

For those looking for all the articles in our GAN series. Here is the link.