Conditional GANs (CGANs) with codes explained

Training CGANs on a multi-class image dataset

Mehul Gupta

Published in

Data Science in your pocket

10 min readFeb 10, 2023

After covering

Generative AI overview (pre-requisite for this post)
Variational AutoEncoders
Vanilla GANs (pre-requisite for this post)
CycleGANS
PixelCNN

In my 105th post, I explored what are Conditional GANs (CGANs) alongside their implementation in python over the boots vs sandal vs shoe dataset. So, let’s first of all understand

Why CGANs exist?

For all the GANS/Generative models we have explored so far (except CycleGANs that are used for image2image translation), one thing quite common is we don’t have any control over the output image.

For eg: If you train a Vanilla GAN or PixelCNN or VAE over the MNIST dataset when the Generator is finally trained, you don’t have the option to ask the generator to generate images of digit ‘6’ or ‘7’ but the model randomly generates images from any of the one classes. So, if you wish to generate an image of a specific digit, you would be iterating through multiple outputs to get to the desired class image finally. The problem becomes even worse when dealing with datasets like Face datasets where you wish to mention pointed nose, long hair, and other attributes to generate images with specific attributes.

Iterating over generated images till you find your desired image isn’t a feasible solution. Wouldn’t it be great if you are able to tell your wish to the GAN itself? that generate a ‘9’ or ‘2’ for me. Than Generative AI can be of some serious use.

Conditional GANs do the exact thing !!

They intake input from the user about which class they wish to generate an image for and they generate an image belonging to the same class. Voila!!

As we have enough motivation now, let’s deep dive into how to do CGAN work

In Vanilla GANs, we have a pair of Generator-Discriminator where the Generator tries to generate fake images using random noise trying to fool the discriminator that they are real images where the Discriminator’s role is to identify whether the image is fake or real.

Very similar to this, CGANs also have a pair of Generator and Discriminator but with a difference

The generator intakes random noise + labels to generate an image belonging to that particular class.

So, if the label passed is ‘2’, generator tries to generate an image for class ‘2’ and assign it a label ‘2’ before passing it to Discriminator

Discriminators need to train on two aspects 1) Detect whether the image is real or fake and 2) To which class it belongs. If any of the above criteria goes wrong (like the real image but the wrong label or vice-versa), the Discriminator classifies the sample as fake(0) or otherwise real(1). Do remember, this still remains a binary classification problem for the discriminator as we would be embedding label information in the image itself and though the discriminator learns about the different classes, it’s not because of the loss function. How? will talk about it shortly

Hence, The generator should be able to make the discriminator believe that the image it has generated is 1)Real 2)Belongs to the same class as label passed by generator to discriminator to become a pro liar.

The rest of the whole process remains the same as Vanilla GANs where both the Generator and Discriminator trains together where, on one hand, Discriminator tries to reduce the final binary cross-entropy loss while Generator tries to increase it making it an adversarial system.

Wait a minute? I have missed on an important point

How are we able to attach the label information (mostly an integer/category) to image samples( a 2D/3D array)?

It’s a classic case of Multimodal modeling where we need to handle different data types in the same problem statement (as we did in my last DALL-E working explanation). In this case, we would be

Generating an embedding for Label category.
Expand the dimensions of the embedding equal to N*N 1d array where N is dimension of image dataset, using a Dense layer. So assume the embedding size=100 for labels, using a Dense layer, we expand this to 28x28=784 where N=28
Reshape this expanded dimension as a 2D image.
Add this 2D image as a new channel to the corresponding image.

So, we would be eventually converting label information as a 2d image and integrate it as a new channel to the corresponding image integrating the label information within the image. Simple !

Training flow

Take labeled multi-class image dataset (binary classification dataset should also work). These will be our ‘Real’ images.
Generate a ‘fake’ dataset using a generator with class labels using random noise+random labels as input to the generator. The random labels will be integrated with the random noise helping the generator to generate images for a particular class.
Mix samples from ‘real’ and ‘fake’ datasets. Label ‘real’ samples as 1 and fake ones as 0.
This mixed dataset is fed to Discriminator which tries to detect fake vs real samples.
Once training is done, the discriminator is discarded and only the generator is used for generating images for specific classes.

Don’t get confused with class-labels and final labels. Class-labels have no role to play in final loss function but only final labels (0 & 1 i.e. fake vs real) would be used in binary cross-entropy loss. Class-labels are forwarded to discriminator so as they can be integrated with the image to help the Discriminator distinguish between different classes.

In short, CGANs differ from Vanilla GANs only in the point that we are passing label information to both the generator and discriminator, and the generation process is guided by this meta information.

Time for some action !!

We would be trying hands-on CGANs over the Shoe vs Sandal vs Boots dataset with 5k images each for every category

If you wish to follow the whole demo, do download the dataset and unzip

Importing required libraries

import numpy as np
import numpy.random as R
from tensorflow.keras.datasets.fashion_mnist import load_data
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Input, Dense, Reshape, Flatten, Conv2D, Conv2DTranspose
from tensorflow.keras.layers import LeakyReLU, Dropout, MaxPooling2D, Embedding, Concatenate
import matplotlib.pyplot as plt
from tqdm import tqdm
import glob
from PIL import Image
import tensorflow as tf

2. Next, we would be creating a dictionary assigning every label an integer value from 0 →N-1 where N is the total number of classes. Just to be sure, I got the path for all the files and got the label names from the path itself (I know it’s overkill).

files = glob.glob('archive/Shoe vs Sandal vs Boot Dataset/*')
#windows path
labels = list(set([x.split('\\')[1] for x in files]))
label_dict = {y:x for x,y in enumerate(labels)}

3. Preparing a ‘real’ dataset

def load_real_samples():
    x_train,y_train = [],[]
    for x,y in label_dict.items():
        files = glob.glob('train/{}/*'.format(x))
        for file in files:
            x_train.append(np.asarray(Image.open(file).resize((28,28)).convert('L')))
        y_train.extend([y for value in range(len(files))])
    x_train = np.array(x_train).astype('float32').reshape(-1,28,28,1)
    return [x_train,np.array(y_train)]

dataset = load_real_samples()

What happening there?

Looping over all the labels, we are reading image files from corresponding folder (as the dataset unzips into multiple folders with each folder corresponding one label)
As image size is quite big, we are standardizing images to 28x28, 1 channel images.
Appending these images to training dataset alongside class-label integer mapping in y_train and return.

Next, let’s define a function to get a batch of ‘real’ images and label=1 for this whole batch

def generate_real_samples(dataset, n_samples):
    images, labels = dataset
    
    #generating n random samples
    ix = R.randint(0, images.shape[0], n_samples)
    X, labels = images[ix], labels[ix]
    
    #Observe how class-labels alongside binary label(1) is return. 
    y = np.ones((n_samples, 1))
    return [X, labels], y

Time to define our generator network

latent_dim = 100

def build_generator(latent_dim, n_classes=3):
    in_label = Input(shape=(1,))

    #Class-Label embedding
    label = Embedding(n_classes, 50)(in_label)
    n_nodes = 7 * 7

    #Expanding class-label embedding
    label = Dense(n_nodes)(label)

    #Converting flat array as 2d image 
    label = Reshape((7,7, 1))(label)

    in_latent = Input(shape=(latent_dim,))
    n_nodes = 128 * 7 * 7

    #expanding random noise vector and converting in 7x7x128 image
    gen = Dense(n_nodes)(in_latent)
    gen = LeakyReLU(alpha=0.2)(gen)
    gen = Reshape((7, 7, 128))(gen)

    #Adding class-label 2d image to this random noise image
    merge = Concatenate()([gen, label])

    #Creating features in image by upsampling 
    gen = Conv2DTranspose(128, (4,4), strides=(2,2), padding='same')(merge)
    gen = LeakyReLU(alpha=0.2)(gen)
    gen = Conv2DTranspose(128, (4,4), strides=(2,2), padding='same')(gen)
    gen = LeakyReLU(alpha=0.2)(gen)

    #Converting final image to a 1 channel image
    out_layer = Conv2D(1, (7,7), activation='tanh', padding='same')(gen)
    model = Model([in_latent, in_label], out_layer)
    return model

generator = build_generator(latent_dim)
generator.summary()

The generator, in short, is doing the below things

Converting class-label to 2d image
Convert random noise to a multi channel image
Attache class-label 2d image to this random image
Upsample using Conv2DTranspose

We need to define 2 more functions, one for a ‘fake’ dataset and the other for generating a latent vector to be fed to the generator to generate fake images.

def generate_latent_vector(latent_dim, n_samples, n_classes=3):
    x_input = R.randn(latent_dim * n_samples)
    z_input = x_input.reshape(n_samples, latent_dim)
    labels = R.randint(0, n_classes, n_samples)
    return [z_input, labels]

def generate_fake_samples(generator, latent_dim, n_samples):
    z_input, labels_input = generate_latent_vector(latent_dim, n_samples)
    images = generator.predict([z_input, labels_input])
    y = np.zeros((n_samples, 1))
    return [images, labels_input], y

The generate_latent_vector does nothing but

Generate ’n’ random vectors
Assign random class-labels to these random vector.

The generate_fake_samples has the following role

Generate latent vector with class-labels using generate_latent_vector function
Generate images using generator and above generated latent-vector
Assign label=0 to all these samples (as this is fake dataset, already discussed).

Time to define Discriminator (a binary classifier)

def build_discriminator(in_shape=(28,28,1), n_classes=3):
    in_label = Input(shape=(1,))
    label = Embedding(n_classes, 50)(in_label)
    n_nodes = in_shape[0] * in_shape[1]
    label = Dense(n_nodes)(label)
    label = Reshape((in_shape[0], in_shape[1], 1))(label)
    in_image = Input(shape=in_shape)
    merge = Concatenate()([in_image, label])
    disc = Conv2D(128, (3,3), strides=(2,2), padding='same')(merge)
    disc = LeakyReLU(alpha=0.2)(disc)
    disc = Conv2D(128, (3,3), strides=(2,2), padding='same')(disc)
    disc = LeakyReLU(alpha=0.2)(disc)
    disc = Flatten()(disc)
    disc = Dropout(0.4)(disc)
    out_layer = Dense(1, activation='sigmoid')(disc)
    model = Model([in_image, in_label], out_layer)
    opt = Adam(lr=0.0002, beta_1=0.5)
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
    return model

discriminator = build_discriminator()
discriminator.summary()

The Discriminator is doing the following tasks

Converts the class-labels to 2d image and integrate with actual image passed. This actually helps the discriminator to identify which class the passed image should belong
A shallow baseline CNN for binary classification of real vs fake samples over passed images

Time to combine everything and define our CGAN

def build_cgan(g_model, d_model):
    d_model.trainable = False
    gen_noise, gen_label = g_model.input
    gen_output = g_model.output
    gan_output = d_model([gen_output, gen_label])
    model = Model([gen_noise, gen_label], gan_output)
    opt = Adam(lr=0.0002, beta_1=0.5)
    model.compile(loss='binary_crossentropy', optimizer=opt)
    return model

cgan = build_cgan(generator, discriminator)
cgan.summary()

This CGAN architecture is just combining everything by intaking a random vector alongside the class label as input and producing the final label (0/1) as output. The discriminator is set to trainable=False because we would be training it separately in the training loop below alongside CGAN.

Training begins

batch_size = 128
epochs = 10
batch_per_epo = int(dataset[0].shape[0] / batch_size)
half_batch = int(batch_size / 2)
for i in range(epochs):
    for j in tqdm(range(batch_per_epo)):
        #generate real sample
        [X_real, labels_real], y_real = generate_real_samples(dataset, half_batch)
        #train discriminator on real dataset
        d_loss1, _ = discriminator.train_on_batch([X_real, labels_real], y_real)
        #generate fake sample
        [X_fake, labels], y_fake = generate_fake_samples(generator, latent_dim, half_batch)
        #train discriminator on fake dataset
        d_loss2, _ = discriminator.train_on_batch([X_fake, labels], y_fake)
        #Training CGAN
        [z_input, labels_input] = generate_latent_vector(latent_dim, batch_size)
        y_gan = np.ones((batch_size, 1))
        g_loss = cgan.train_on_batch([z_input, labels_input], y_gan)
    print('>Loss Discriminator: {}, {} , Generator: {}'.format(d_loss1,d_loss2,g_loss))

The above training loop is doing the following things

Generating real and fake batches and training discriminator on them
Generating latent vectors and class-labels to train CGANs

How CGAN training is done requires special attention

While training CGANs, we are passing the latent vector and final labels=1. Why? Because

Discriminator is (assumed) already trained as its training is not related to CGAN in the code snippet. Hence, actual real images aren’t required while training CGANs as
Hence, the ideal case for CGANs is whatever input it gives, the Discriminator gives output=Real (label=1). Hence, the true labels passed are all 1s. Hence, we wish to train the Generator (as discriminator training is stopped while training the whole CGAN architecture and acting as an ideal image classifier) in such a way that the discriminator’s output is always 1 (hence it classifies every image by Generator as real) hence the generator gets trained to produce real looking images !!

Take a minute to get the above statement

Last part, visualizing the images generated by the trained CGAN. As I did this whole demo on my local, couldn’t train CGAN extensively hence the results may look a bit average. But if trained for a longer duration with GPU, the results will improve for sure. See the results after training for epochs below

def generated_plot(examples, n):
     fig, ax = plt.subplots(1,n,figsize=(12,12))
     for index,x in enumerate(examples):
        ax[index].imshow(x,cmap='gray')
        ax[index].axis('off')
     plt.show()
     
#latent vector generator with specific class-label and not random class
def generate_latent_vector_class(latent_dim, n_samples, class_label):
    x_input = R.randn(latent_dim * n_samples)
    z_input = x_input.reshape(n_samples, latent_dim)
    labels = np.array([class_label for _ in range(n_samples)])
    return [z_input, labels]

# code to generat images for class=0, similarly repeat for other 2 classes
latent_vectors, labels = generate_latent_vector_class(100,10,0)
generated  = generator.predict([latent_vectors, labels])
generated = (generated + 1) / 2.0
generated_plot(generated, 10)