DCGAN, cGAN and SAGAN & the CIFAR-10 dataset

Shruti Bendale
Analytics Vidhya
Published in
8 min readMar 31, 2020
Images from theCIFAR-10 dataset. Image Source: http://cs231n.github.io/classification/

In my last blog, I talked about Generative Adversarial Networks. Today, I’ll be talking about Deep Convolutional GANs, Conditional GANs and Self-Attention GANs and how I implemented these models on the CIFAR-10 dataset.

Deep Convolutional Generative Adversarial Network:

DCGAN introduced a series of architectural guidelines with the goal of stabilizing the GAN training. It advocates for the use of strided convolutions instead of pooling layers. Moreover, it uses batch normalization (BN) for both generator and discriminator nets. Finally, it uses ReLU and Tanh activations in the generator and leaky ReLUs in the discriminator.

DCGAN Architecture. Source

In DCGANs, the generator is composed as a series of transpose convolution operations. These operations take in a random noise vector, z, and transform it by progressively increasing its spatial dimensions while decreasing its feature volume depth. The discriminator is basically a convolutional neural network. Its task is to classify an image as real or fake. We feed the images generated by the generator to the discriminator along with the real images and the discriminator classifies the images as real or fake. We calculate the discriminator and generator loss and backpropagate the loss to improve the performance of the generator and the discriminator.

Model:

The generator uses Conv2DTranspose (upsampling) layers to produce an image from a seed (random noise). We with a Dense layer that takes this seed as input, then upsample several times until we reach the desired image size of 32x32x3. We use LeakyReLU activation for each layer and use tanh activation function for the last layer.

The discriminator is a CNN-based image classifier. We use the discriminator to classify the generated images as real or fake. The model will be trained to output positive values for real images, and negative values for fake images.

Our generator and discriminator architecture for implementing DCGAN is as follows:

Generator network(left) & Discriminator network(right)

We use the Binary Crossentropy loss function to calculate the losses for the generator and the discriminator.
Adam optimizer was used for both the generator and the discriminator. We use the two- timescale learning rates update rule (TTUR) implemented in the SA-GANs paper by setting the learning rate of a generator as 0.0001 and that of the discriminator as 0.0004.

Training:

We pass the train images from the CIFAR-10 dataset in batches of 64 to the generator in the training loop. The images generated by the generator are passed to the discriminator along with the real images.

The discriminator and generator losses are calculated using the following snippet of code:

I trained the model for around 3200 epochs and logged the generator loss and the discriminator loss at every epoch. We also calculate the FID score at every 10 epochs. The plot of the losses over 3200 epochs and the plot of the FID scores can be seen in the following images.

Plot for losses for the model over 3200 epochs(left); Plot for FID Scores of the model over 3200 epochs

Results:

The images generated by the DCGAN on epoch 3190 where we obtained the best FID score of 69.09 is as follows:

Images generated by the DCGAN

Mode collapse:

It was observed that the DCGAN generates many similar images. This can be clearly observed in the image below. The highlighted images look very similar because of mode collapse:

Mode collapse in DCGAN

Self-Attention Generative Adversarial Network:

Self-Attention for Generative Adversarial Networks (SAGANs) is one of the modifications to GANs. Self-Attention GANs have an architecture that allows the generator to model long-range dependency. The key idea is to enable the generator to produce samples with global detailing information.

Architecture:

Self Attention Generative Adversarial Network. Source

We add the Attention layer after a convolutional layer in the generator and the discriminator. The output of the previous convolution layer outputs convolution feature maps of the dimension (height x width x channel).

Given the input features to a convolutional layer L, the first step is to transform L in 3 different representations. We convolve L using 1x1 convolutions to get three feature spaces: f, g, and h. The feature vectors f and g have different dimensions than h. f and g use 8 times less convolutional filters than h does. Here, we use f and g to calculate the attention. To do that, we linearly combine f and g using matrix multiplication and the result is fed into a softmax layer. The tensor obtained from this operation is the ‘attention map’.

Why use attention?

Traditional deep convolutional GANs are unable to capture long-range dependencies in images. These conventional GANs work well for images that do not contain a lot of structural and geometric information. They fail to represent global relationships faithfully. These non-local dependencies consistently appear in certain classes of images. For example, GANs can draw animal images with realistic fur, but often fail to draw separate feet.

In SAGAN, the self-attention module works in conjunction with the convolution network and uses the key-value-query model (Vaswani, et al., 2017). This module takes the feature map, created by the convolutional neural network, and transforms it into three feature spaces. These feature spaces, called key f(x), value h(x), and query g(x), are created by passing the original feature map through three different 1x1 convolution maps. Key f(x) and query g(x) matrices are then multiplied. Next, the softmax operation is applied on each row of the multiplication result. The attention map generated from softmax identifies which areas of the image the network should attend to.

Model:

The SAGAN architecture is similar to the DCGAN architecture but with a custom ‘Attention layer’ added after the ‘conv2d_transpose_4’ layer in the generator and after the ‘conv2d_11’ layer in the discriminator.

I experimented with using the Hinge loss function and the Binary Crossentropy loss function to calculate the losses for the generator and the discriminator as specified by the SA-GAN paper.

I also experimented with using the RMSprop optimizer and Adam optimizer for the generator and the discriminator.

I found out that my model gives the best outputs and better FID scores when the combination of BCE loss with Adam optimizer is used. We also use the two-timescale learning rates update rule (TTUR) implemented in the SA-GANs paper by setting the learning rate of the generator as 0.0001 and that of the discriminator as 0.0004.

Training:

After training the SAGAN model for 200 epochs, we obtained the following graphs for the losses and FID scores.

Plot for losses for the model over 200 epochs(left); Plot for FID Scores of the model over 200 epochs

Results:

The images generated by the SAGAN on epoch 198 where we obtained the best FID score of 84.83 is as follows:

Images generated by the SAGAN

Conditional Generative Adversarial Network:

In a GAN, creation starts from white noise. However, in the real world, what is required may be a form of transformation, not creation. Take, for example, colorization of black-and-white images, or conversion of aerials to maps. For applications like those, we condition on additional input: Hence the name, conditional adversarial networks.

This means that the generator is passed not (or not only) white noise, but data of a certain input structure, such as edges or shapes. It then has to generate realistic-looking pictures of real objects having those shapes. The discriminator, too, may receive the shapes or edges as input, in addition to the fake and real objects it is tasked to tell apart.

As a cGAN conditions the output data distribution based on a Condition layer, in the objective function of GAN, log(1 — D(G(z)) and D(x) will be replaced by log(1 — D(G(z|y)) and D(x|y). The rest of it would be taken care of by the respective networks, i.e., creating latent representations and managing weights. The main objective remains the same here with little modifications:

Conditional Self Attention Generative Adversarial Network:

We modify our SAGAN model to include additional inputs, y, on which the model can be conditioned. The CIFAR10 dataset contains the labels associated with each image. We convert this label to a one-hot encoding representation use it to condition the generate images with respect to the class.

Model:

The generator and discriminator architecture for implementing DCGAN is as follows:

Training:

We pass random labels to the generator along with our noise vector. The generator concatenates these two inputs and generates images. While training the discriminator, we pass the generated images with the random labels passes to the generator to generate ‘fake outputs’. We also train the discriminator on real images with their real labels to generate ‘real outputs’. We calculate loss and backpropagate the loss to the generator and discriminator.

Results:

The images generated by the cSAGAN on epoch 96 where we obtained the best FID score of 92.28 is as follows:

Images generated by the cSAGAN

In Conclusion…

The following table summarizes outputs of all the models and the metrics used:

Other Experiments:

I tried implementing the spectral normalization mechanism with a WGAN but the images generated by the generator did not improve over time and the FID score wasn’t decreasing.

I also tried implementing the Wasserstein loss with the SAGAN but the generator loss was very low (in the negatives) and the model didn’t seem to learn over time. It also took a lot of time for each iteration to complete training.

--

--