The Story of DCGANs, WGAN and CGANs

Published in

Analytics Vidhya

10 min readAug 13, 2021

An introduction into selected variants of GANs and their defining concepts.

Whoosh! I finally bring myself to write and pen my thoughts. This article will take you on a journey through my findings on certain variations of GANs and what distincts each from the other. If you’re a data science enthusiast like I’m, I’m quite sure you have found the concept of GANs and it’s possibilities monumental.

Highlights of what we will disucss in this article:

Brief overview of GANs
Deep Convolutional Generative Adversarial Networks (DCGANs)
Wasserstein Generative Adversarial Network (WGAN)
Conditional Generative Adverserial Networks (CGANs)

Brief Overview of GANs

Generative Adversarial Networks (GANs) were proposed by Ian Goodfellow et al. Within the original setting, GANs are composed of a generator and a discriminator that are trained with competitory goals. The generator is trained to generate samples towards the true data distribution to fool the discriminator, while the discriminator is optimized to distinguish between real samples from the true data distribution and fake samples produced by the generator.

To put in graphic terms, the generator is an art forger and the discriminator is the art expert or art critic institutions bring in to tell real art from fake art. It’s more complicated than this, but it explains their basic roles. It is an interesting application of neural networks. As we go further, we shall see how it is made possible in different variations. Recently, GANs have shown great potential in simulating complex data distributions, such as those of texts, images and videos.

DCGANs

Deep Convolutional GANs are one of the less complex and easy to implement variation of GANs, proposed by Alec Radford, Luke Metz and Soumith Chintala in a paper called Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks.

This is a pictoral representation of the DCGANs generator for the LSUN scene modeling paper, it takes a random noise vector of shape 100x1x1, which is denoted as z and it is passed into the generator network which maps it into the G(Z) output which is of shape 64x64x3. The network contains transposed CNN layers which upsamples the input tensor, batch normalization is applied and every layer has a Relu activation except the last layer which contains a Tanh activation that scales the images in the range of [-1, 1].

The first layer as shown expands the input from 100x1x1 into 1024x4x4, this layer is called project and reshape. After this layer classic convolutional layers are applied which reshapes the network with the (N + P — F)/S +1 equation which is associated with convolutional layers. We can see the N parameter (height, weight) go from 4 to 8 to 16 to 32, the kernel filter F is 5x5 and the stride S is 2, there appears to be no padding.

The network goes from 100x1x1 → 1024x4x4 →512x8x8 →256x16x16 →128x32x32 →64x64x3.

The above image is an impressive output from the generator after 5 epochs of training.

DCGANs Key Concepts

It consists of neural networks like every other GAN i.e A Generator and a Discriminator.
The Discriminator takes the generated images and real images as input and outputs a value between 0 and 1 i.e the confidence level of an image being real or fake.
The Generator doesn’t see real images, it learns via feedback from the Disriminator.
Strided convolutions and fractional strided convolutions are used inplace of pooling layers in Discriminator and Generator respectively.
Batch normalization is used in both Generator and Discriminator.
A deeper architecture is used instead of fully connected hidden layers.
ReLu activation is used in the Generator for all layers except for the output layer which uses Tanh activation.
LeakyReLu activation is used for all layers in the Discriminator.
Models are trained with mini-batch stochastic gradient descent (SGD) with a mini-batch size of 128.
A binary cross-entropy loss (BCE) function is used for the Generator.

Constraints of DCGANs

Some of the DCGANs limitations are as result of its BCE loss funtion, namely:

Mode Collapse: This is used to describe the inability of a GANs network to generate different classses of a distribution e.g MNIST, the GANs may only be able to generate a class out of a possible 10 possible classes, In the Dog implementation, the model might only be able to generate a specie of Dog amongst the many possible species available.
Vanishing Gradients: The confidence level of the Discriminator is a single value that can only be between 0 and 1, the goal is to get the value closer to 1 as much as possible, hence when the calculated gradient approaches 0, the Generator is unable to get much information rendering it unable to learn and this leaves us with a strong Discriminator and a weak Generator.

WGAN

Wasserstein GAN was proposed in a paper by Martin Arjovsky, Soumith Chintala, Léon Bottou named Wasserstein GAN. The algorithm was introduced as an alternative to traditional GANs training, in this model the stability of learning was improved and problems like mode collapse were taken care of.

Unlike DCGANs, in this paper, EM distance also known as wasserstein distance was intoduced as metric, EM-distance is the amount of effort to move one distribution to another, i.e how much work you should spend to transport the distribution to another one. The value is positive and the shape is symmetric. There are two properties that the EM-distance has:

The function is continous anywhere
The gradient of the function is almost everywhere

With these two properties, we are able to avoid the problem of vanishing gradient and can continue training and updating our models till it converges.

However, when finding the infimum, it’s hard to exhaust the whole possible sample in the joint distribution. Using Kantorovich-Rubinstein duality method, the problem can be approximated into its dual format, and we go ahead to find its supremum. The relation between the two form is shown above. The only constraint is that the function should be the Lipschitz-1 continuous function i.e the norm of the gradient must be atmost 1 on every point.

The objective of WGAN.

In DCGANs, we always aim to maximize the score of classification. If the image is fake, the Discriminator gives it a score value 0 and if the image is real the Discriminator gives a score value 1. In WGAN, the task of the discriminator changes to more of a regression problem, Martin renamed it as critics. The critic should measure the EM-distance i.e how much work should be spent and find the maximum case as shown in the figure above.

The figure above shows the training process of WGAN which is similar to usual GANs, the major differences are:

The critics update for multiple times.
We don’t use cross-entropy while computing the loss i.e we don’t need to take the logarithm.
Weight clipping should be done to satisfy the constraint of Lipschitz continuity.
Don’t use momentum-based optimizers like Adam optimizer.

To enforce Lipschitz continuity, weight clipping is done. Weight clipping comes with its own con and it is discussed at length in this article. Another method was proposed to reach the limit of Lipschitz continuity: merge the limit term to the loss function. The idea of this change is similar to adding constraint term in the mechanism of SVM. The only difference would be the lagrange multiplier, which is the optimal parameter that should be found by quadratic programming, but we just need to set it as a constant in WGAN. This term is called gradient penalty.

Revised objective in WGAN-GP

The above figure shows the revised loss function. However, in the definition of Lipschitz continuity, we are to exhaust the entire possible sample in the joint distribution. The author claims that we don’t need to consider the whole sample at all. In fact, it was proposed that we just need to generate the combination between the two distributions, and only do the penalty towards these middle sample.

The figure above is the revised training process, and it was given another name: WGAN-GP. In this version, a momentum-based optimizer can be used to update the model, and won’t cause a loss explosion error. In updating the critics, the gradient penalty term should be added into the loss function.

There is one thing you should remember: Batch normalization should not be used! In Kantorovich-Rubinstein’s theory of duality, the gradient of each pair is unique. However, batch normalization will shuffle this circumstance, and make the mapping disturbed. As a result, the layer normalization (or other approach) was recommended to be used in the structure. This article was really helpful in detailing the concepts of WGAN and WGAN-GP.

CGANs

Conditional GANs (CGANs) is a much more placid variation of GANs. As the name implies, CGANs are allowed to generate images that have certain conditions or characteristics.

Like both variations of GANs discussed earlier, CGANs has two components a Generator and a Discriminator. where CGANs stands apart from the earlier variants discussed is both Generator and Discriminator recieve additional conditioning input information, which could be a class of a current image or some other perculiarity. This is what makes it much cooler compared to DCGANs who we don’t have control over what class it generates. That particular issue of DCGANs is addressed in CGANs, we add an additional input layer with values of one-hot-encoded image labels.

Perculiarities of CGANs

Adding a vector of features guides and controls the output of the generator, it helps the generator figure out what to do.
The vector features should encode the class of the image we intend to generate or a set of specific characteristics we expect the image to possess.
We incorporate the information into the images that will be learnt and also into the Z-input, which is not so random again.
The Discriminator’s evaluation is done not only on the similarity between fake data and original data, but also on the correspondence between the fake image to the its input label (or features).
We can do as would do for a DCGANs but a condition in form a one-hot vector must be imposed on both the Generator and Discriminator.

Note: CGans are not strictly unsupervised, they require some kind of label for them to work.

The Discriminator and Generator model for a CGANs is similar to a DCGANs’, the difference is the one-hot vector used to condition both the Discriminator and Generator.

The Discriminator has two tasks

Correctly label real images which are sampled from the training data set as ‘’real’’
Correctly label images from the Generator as ‘’fake’’.

we calculate two losses for the Discriminator, the sum of fake and real image loss is the overall Discriminator loss. The loss function aims to minimize the error of predicing real images from the train set and fake images from the generator given their one-hot labels.

Discriminator’s loss function

The Generator is saddled with only one responsibility, which is to generate an image that looks real enough to fool the Discriminator. Its loss function aims to minimize the correct predictions of the Discriminator.

Discriminator’s Training Flow

Generator’s Training Flow

I found this article very useful and helpful in my explanation of CGANs.

Conclusion

I hope this article was in anyway useful and not a bore. DCGANs, WGAN and CGANs are all interesting variations of GANs. They all have strenghts and weaknesses. I have tried to document all I know about each variant and I look forward to writing a more code centric article on each of them, showing how to go about their implementation in python. All this in nearest future.

References

This was my first medium publication, I very much would appreciate feedbacks and suggestions on both the subject discussed and any other input that will help me become a better Data Scientist and writer. Feel free to connect with me on LinkedIn. I’d be happy to take any of your questions or Data Science freelancing gigs :)