Creating DC-GAN for TPU using a custom learn function in TensorFlow

Why GAN?

4 min readAug 12, 2022

Generative Adversarial Networks (GANs) are one of the advanced machine learning algorithms today, simple yet computationally intensive make this the right candidate for hardware acceleration. In GANs, two models are trained simultaneously by an adversarial process. A generator (“the artist”) learns to create images that look real, while a discriminator (“the art critic”) learns to tell real images apart from fakes.

(Please follow this TensorFlow page in order to further understand GANs and running DC-GAN in your Colab)

The development of any GAN involves training at least two models simultaneously for many iterations (because of the adversarial process). Because of this increased computational demand by GANs, it is empirical that we identify different strategies to achieve the optimum generative model in the least amount of computational time. One can always improve the learning time by using the optimum values of hyper-parameters, the right selection of model architecture, data augmentation, and increasing computational resources. Increasing computational resources is not feasible for everyone. But if you are using Google Colab for training your models, you can use hardware accelerators from notebook settings to improve your computational power.

There are mainly two choices for hardware acceleration in your Google Colab, either you can choose GPU or TPU. The use of GPUs is straightforward in Colab as any Keras model with the latest TensorFlow should automatically identify and make optimum use of your GPU. When it comes to TPUs, you will need to modify your codes to get optimum performance. You will face a similar challenge when trying to use multiple GPUs connected to the same machine.

Why use TPU?

TPUs are custom-designed application-specific processing units for improving the performance of machine learning algorithms. TPUs could be 8-30 times faster than GPUs in Google Colab. In the current example code, we have achieved more than 3 times speed improvement as compared to GPU. If you are opting for Google Cloud TPUs instead of GPU based computing engine, you will find a better cost to performance.

Strategy

In the case of Google Colab, you will be getting different cloud TPU clusters. One needs to connect and initialize these TPUs before using them. The cluster_resolver helps us connect to these cloud TPUs from Google Colab. While TPUStrategy will help us run the model parallelly in different TPU devices (these individual units/nodes are called replicas). You usually get 8 replicas for TPUs in Google Colab. You should use tf.distribute.MirroredStrategy() , if you want to run the model across multiple GPUs connected to the same machine. You can also follow this link to read further on connecting to TPUs and using them in your Google Colab. Following is the code for connecting the TPUs to your Google Colab

The scope of strategy

As multiple computational units are involved in the algorithms and gradient calculations, the processes needed to be distributed correctly among the clusters. These are handled well by the TPUStrategywhich is saved as strategy variable. You just need to define the relevant codes within strategy.scope() . This includes defining the model, loss function, train_step (more specifically distributed train_step), and optimizer functions.

Define and generate the models

Now let us create two models for GAN, one for generating the image (generator) and one for criticizing the generator (the discriminator). The model definition does not vary from the normal definition of a model for CPU or GPU, but the model initialization has to happen within the strategy.scope() as shown below.

Defining the loss functions

As the batch of data is split across various TPU replicas, we need to ensure that losses from the different clusters are aggregated together correctly. Use the function tf.nn.compute_average_loss to make sure the average value is calculated across each TPU clusters/replicas. Further aggregation of the loss function is done at distributed_train_step.

Defining the custom train_step and distributed train_step

In the case of GANs, two models will be learning simultaneously. Discriminator has to learn to distinguish between fake (generated images) and real (dataset) images. The real images identified as ones and fake images identified as zeros should guide the gradients in the discriminator (please refer to the discriminator loss function above). While, the generated images that fooled the discriminator to classify as real, should guide the generator gradient. That is why an inverse loss function is defined for the generator. GradientTape will handle the gradient calculation of the model across various replicas.

The function strategy.run() is called to ensure the training step runs in different replicas with synchrony. This requires a dataset that is prepared for distribution across various replicas. The function strategy.reduce can be used to reduce the multiple values coming from different replicas into a single value

Defining the dataset for learning

Creating a dataset for TPU is similar to creating a dataset for GPU or CPU. After you have created your dataset, you have to distribute the dataset to each replica. Once again, TPUStrategy will take care of this using the function strategy.experimental_distribute_dataset().

In the following codes, image labels were also saved. You can use them if you are building an AC-GAN.