How to Work with Tensor Processing Unit (TPU) in Google Colaboratory on Tensorflow 2

Maxim Vakurin
Deelvin Machine Learning
3 min readJun 16, 2020

When I decided to try the Tensor Processing Unit (TPU) in Google Colaboratory, I ran into a number of unforeseen problems. Thus I decided to catalogue them for future users. If you are a regular user or a new user planning to start your project with TPU in Google Colab, then please continue reading. I have something useful to share with you. Furthermore, TPU in Google Colab is perhaps a good place to start for general interest readers as it is likely that in the future you will be able to rent an instance with a TPU and train large models.

🤖TPU

Google introduced the TPU in 2016. The third version, called the TPU PodV3, has just been released. Compared to the GPU, the TPU is designed to deal with a higher calculation volume but with less accuracy. To better understand the difference between GPU and TPU, you can see the useful demo site made by the Google Cloud team. It presents calculation animations.

Tensor Processing Units(TPU)
Image source: https://ko.com.ua/files/u5101/cloud-tpu-1.jpg

🛠Preparation

We will be training the SRGAN model using custom training. All of the source code is on GitHub. We will be working with the DIV2K dataset, which you can access here. You can also import this dataset using TensorFlow. However, the downside with the latter strategy is that you will end up skipping a couple of important points in our workings with the data.

To work with TPU you would need to run the following steps:

● Create a dataset in TFRecord format (code).

● Important! Load the dataset on the Google Cloud Platform, and specifically in the Bucket section. Note that, it won’t work otherwise due to TPU design issues.

🚀Highlights

1.You would need to check whether TPU is included on your Colab laptop. If not, then you can change it using Edit -> Laptop Settings -> TPU.

2. Initialize TPU and choose a work strategy.

3. In order to work with TPU, create a Bucket on the Google Cloud Platform and download the dataset. To access the GCP from Colab, register user authentication.

4. Perform a general configuration. Configure the global batch size. This yields 4 of each TPU per replica. Thus, a total batch of 32 in its entirety. Specify the location of the dataset on GCP.

5. Next, parallelize all of the calculations. Note that, inside the scope you would usually want to parallelize model initializations, loss and optimizer. In this example, there are two model initializations namely generator and discriminator.

6. Initiate the dataset so that it is distributed according to the strategy.

train_dist_ds_img = strategy.experimental_distribute_dataset(get_training_dataset_images())

7. Here we use Custom Loop for our practice. With the help of strategy.run you can get started, and inside it there is an ongoing count of any losses. (Prior to the availability of Tensorflow 2.2, we used: strategy.experimental_run_v2(train_step, args=(images, labels)))

🛑Note: For models without labels, perform a dummy-run block as a separate argument for the strategy to work. (Example: labels = tf.convert_to_tensor(np.zeros((1, 16)))

🌄Results

After 400 epochs(~30 min) we can get results like this:

Results

--

--