DCGAN Under 100 Lines of Code

Published in

The Startup

8 min readJun 25, 2020

Although Deep Convolutional GAN has been visited time and again, for the ones reading the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks without much context it can still be a great challenge. I have always found reading a research paper and implementing it gives you a better understanding of the topic than anything else. And the idea presented here is a basic but quite interesting implementation of GAN.

The network is implemented using the Pytorch framework and trained over the MNIST dataset.

Take a look at the image above. One-half of the digits are real and the other is generated using a DCGAN network and randomly spread across the image.

INTRODUCTION:

Generative Adversarial Networks have been used in all kinds of applications since the introduction in 2014 in the paper by Goodfellow et al., from generating photorealistic images, image-to-image translation to high-quality conditional image synthesis, and 3D object generation. But at the core of all these implementations lies the idea of Deep Convolutional GANs, which in the most basic terms, implements a convolutional architecture to extract various features present in the data as the Generator and Discriminator networks train. Further, in the original paper, the authors propose the idea of using the discriminator for image classification tasks.

REVISITING THE PAPER:

Before getting into the actual implementation through code, let us first build a little context based on the research papers, which when later translated into code would make much more sense than if we directly jump to the code.

Generative Adversarial Nets

“The generative model can be thought of as analogous to a team of counterfeiters,
trying to produce fake currency and use it without detection, while the discriminative model is
analogous to the police, trying to detect the counterfeit currency.”
Source

The idea that the paper proposes is to build a two model network, one of which is called a Generator which takes randomly generated noise, and using that tries to generate a semantically meaningful representation of data, such as an image. This generated information is called fake data. On the other hand, another network called Discriminator takes data as input and tells if the data is real or fake. Both these networks are trained so as to make the Generator better at generating more realistic data, and Discriminator better at telling the difference between real and fake data(generated by Generator network).

In mathematical terms, the aim of the generator is to maximize the loss of discriminator, i.e, to reduce the probability that fake images are caught by it, while the discriminator strives to minimize the loss by learning to tell fake from real images.

While GAN was a very noble idea, it had it’s shortcomings too. A few being:

Due to its implementation with dense layers, it couldn't capture the underlying pattern in unstructured data, such as images.

“GANs have been known to be unstable to train, often resulting in generators that produce nonsensical outputs.”
Source

2. This instability stems from the fact that while training GANs there lie problems such as diminished gradients, non-convergence of parameters, and limited variation of samples.

Because of all such problems, GANs implemented using simple linear layers forming a dense network weren’t much of use. A lot of these problems were addressed with the introduction of Deep Convolutional GANs.

DC-GAN

“We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)”
Source

As it suggests, the idea is to use convolutional and transpose layers instead of dense layers, along with introducing certain constraints such as batch normalization, use of proper activation functions for different layers, constraining the optimizer used, and designing the upscaling and downscaling layers in a specified manner. All these prove to be invaluable to the overall learning process as well as contribute to different convolutional layers capturing various important features as well.

“We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects.”
Source

Other than this the paper shows how the generator can be manipulated using its vector athematic properties to perform interpolation between different images. Our main focus here is on building the network itself rather than making its use for further tasks.

Some of the key points to keep in mind while implementing the network are as follows:

Images are scaled to the range of tanh activation function, i.e, -1 to 1.
All weights are initialized with a normal distribution of mean 0 and variance 0.02.
Wherever the activation function used is LeakyReLU the slope is set to be 0.2.
Adam optimizer is used for training both the networks.
The learning rate is set to 2e-4.
The momentum term β1 is set to 0.5 for decreasing the oscillations.

Implementation:

The model architecture is trained the same way you would train a regular dense GAN network, but with the above stated modifications. The generator now comprises convolutional transpose layers to capture details that would be used to sketch images from randomly generated noise, and the discriminator network comprises convolutional layers which in turn capture discriminating features among fake and real data.

Libraries & Hyperparameters

Libraries and Hyperparameters

Before going any further I must mention that I made some modifications to the hyperparameters, such as learning rate and β1 used. Why? By training the network and looking at the performance, for the given MNIST data, the set parameters were found to work better than the ones used in the original paper. Besides the images in the dataset used, is of a different dimension and so are the layers used.

We will use Pytorch to train the network, along with torchvision to download, perform certain transforms, and load the data.

Batch size depends on your system configuration. I always suggest playing with the hyperparameters and checking their effect on performance, this helps you understand their influence and makes the task easier with time. Again the learning rate was set so by experimenting with different values keeping in mind the performance. ZDIM represents the number of dimensions for the random noise to be generated. Since the images in MNIST dataset are of shape 28x28, the value of IMG_SIZE = (28,28).

Data

Loading the Data

The data used here is directly downloaded using torchvision datasets and at the same time using transforms is converted to tensors as well as are scaled in range (-1, 1). Then the downloaded and processed data is passed to a dataloader with a batch size of 128 so that we can train over mini-batches.

Generator Model

The generator model comprises a linear layer to transit from ZDIM to a tensor of shape (256*7*7) and then gradually using different ConvTranspose layers the noise is upscaled to represent semantically sound data. After every convolutional step, a batch normalization step is performed so as to make the training more robust in terms of learning the parameters.

Generator Network

Except for the last layer, all others are activated using LeakyReLU with a negative slope set to 0.01. *Again, found out by repeated training.*

And finally, for the last layer, a tanh activation is used.

Passing a randomly generated noise of specified dimension we get a tensor output of (BATCHSIZE, 1, 28, 28) which means, there is a BATCHSIZE number of images each having a single channel since the images in the dataset are grayscale and the dimension of each image is 28x28.

Discriminator Model

On a good contrast, the discriminator model takes in a tensor of dimension, same as that of the Generator’s output(and the IMG_SIZE) and then downscales it through various convolutional layers to a Flatten layer to get an output of size 128x3x3, because passing through different layers the input is converted into shape (BATCHSIZE, 128, 3, 3) which means that there are 128 outputs for the given input each having the shape 3x3.

Discriminator Network

Same as before, batch normalization is performed after each convolutional layer and the activation function used is LeakyReLU except for the last dense layer. This layer is then passed through a sigmoid activation function resulting in a single output indicating if the image is real or fake.

Training

As in the paper, we use Adam optimizer with a modified value for β1(same as paper). Optimizer for both the networks are initialized separately.

Training Loop

Now, for the training, at each step, we sample a batch from real data, and label it with 1, indicating that the image is real. Then using the generator, we generate fake data and label it 0. For the fake data, a random sample is generated from a normal distribution and passed as input for the generator.

The loss function used is binary cross entropy because of the very reason that there are just two classes and one output node with the binary value representing them. For the generator, the loss is discriminator detecting the fake images, while it wanted it to pass as real. On the contrary, discriminator has to minimize the probability of detecting fake images as real and vice versa, and hence, both the outputs are passed to the loss function and added.

Notes

A few things about the implementation part:

The code is implemented using a GPU, rather than a CPU, and it still takes a significant amount of time and hence it’s suggested to run it over GPU.
There are some utility functions that aren’t shown in the above snippets, they can be found at the repository mentioned below.
There are hyperparameters that differ from the original implementation in the research paper.

Conclusion:

DC-GAN has a wide variety of implementation and it acts as a building block in understanding a lot of the other GAN architectures, StackGAN being one of them.

And if you have read the article this far, let me tell you, that in the very beginning of this article I lied.

All the digits in the first figure were fake and were generated by the trained model.

The same architecture with little changes can be used to train on different datasets such as CelebA dataset.

All the code, additional utility functions and the entire notebook can be found at: GitHub

References & Further Reading:

[1]: https://arxiv.org/abs/1406.2661

[2]: https://arxiv.org/abs/1511.06434

[3]: https://github.com/GANs-in-Action/gans-in-action/blob/master/chapter-4/Chapter_4_DCGAN.ipynb