How to use Colab TPUs with large datasets (for almost free)

Nima Wickramasinghe
6 min readDec 1, 2023

Google has been offering free TPUs to train your machine-learning models for a long time. But unless your whole training dataset fits on the RAM, you have to go through a huge process in order to load your data to the TPU.

For those who don’t know about TPUs, TPUs are just like GPUs that have the ability to speed up your ML training process by parallelizing work. However, TPUs are about 10x faster than your average GPU (no, I won't cite this).

Google has offered codes on how to connect to TPUs through Google Colab here. However, the problem occurs when your data is too large to fit into the RAM. This becomes a huge pain!

Usually, what we do is read the data at run time from the local storage and process it. Pytorch provides its Dataset class to do this, and Tensorflow has its own tf.data API for this. However, you may have tried and found that storing the data on the Colab PC does not work and leads to more errors (or very low speeds). To understand why this does not work, let's look at how you are getting connected to Colab PC and TPU.

My understanding of how everything is connected

When training using large datasets, the dataset needs to be loaded from the hard disk, and therefore, since a computing unit such as TPU can do billions of calculations really fast, the bottleneck of speed usually occurs during data transferring to the compute device. In summary, the TPU stays idle until data gets loaded.

The best possible way is to load the data directly into the TPU machine. But Colab doesn't allow access to this (you can do this using TPU VMs in GCP, though). So, the next best possible thing is to load the data into Google buckets. Google buckets are a cloud storage mechanism offered by GCP. Now, Google buckets are not free. But you get free 300 USD credits if you are a first-time user, and even if you pay, it’ll be very cheap (I’m from Sri Lanka, and I know what actually cheap means).

Create your bucket in Google GCP by following this guideline. I used the unique name “medium_test_bucket” here and removed the tick from “Enforce public access prevention on this bucket” to make it easily accessible later. I kept everything else as default.

Next, do the following to allow all users to get public access. (This is probably not the best thing to do. Anyone who gets hold of your bucket name will have access to it. However, granting permissions becomes a pain otherwise. You are allowed to make your own decision here. There might be a way so that only you can access the bucket by some authorization method)

Click on the 3 dots next to your bucket and then go to edit access. Next, click on Add Principal, as shown here.
Type ‘allUsers’ in new principals, assign Storage Admin under Cloud Storage and save it.

So now that we have a place to store the data let us look at the next hurdle. How do we actually load data? The best possible pipeline to load data to TPUs is using tf.data API. However, to do this, you need to convert your data into .tfecord format. Let’s look at that now.

Conversion of your dataset into a set of tfrecord files

Why convert a large number of files (1 million images) into a small number of files (100 tfrecord files)? A large number of small files will take a long time to transfer. At the same time, one large file wouldn’t work either, as this can’t be directly loaded into memory. Therefore, a small number of large files where a single file can be loaded into memory is ideal.

Let’s look at how we can do this in code in Colab.

# Imports
import numpy as np
import tensorflow as tf

# Create random dataset with x as inputs (1000 samples of 16x16 images) and y as labels
# Ofcourse you wont be able to load your whole dataset into a numpy array like this. That is the first reason why we are doing this in the first place
# You will have to find a method to load each of your data samples when needed in the loop at the last portion.
x = np.random.rand(1000,16,16)
y = np.random.randint(0,2,1000)
print('x shape:', x.shape)
print('x datatype', type(x[0][0][0]))
print()
print('y shape:', y.shape)
print('y datatype', type(y[0]))
print()

# Function to convert to Feature. Nothing to change here. You can look into some other features if you like provided by TF
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
if isinstance(value, type(tf.constant(0))):
value = value.numpy()
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

# Function to serialize to Example. Make sure you add everything you need into the dictionary here. Like, IDs, labels, inputs, anything else thats important
def serialize_example(x, y):
feature = {
'x': _bytes_feature(x),
'y':_bytes_feature(y)
}
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
return example_proto.SerializeToString()

# Creating .tfrecord files
num_shards = 10
size_of_shard = len(x)//num_shards

for j in range(num_shards):
print('Writing TFRecord {} of {}...'.format(j,num_shards))
actual_size_of_shard = min(size_of_shard,len(x)-j*size_of_shard)
with tf.io.TFRecordWriter('{}_{}.tfrec'.format(j,actual_size_of_shard)) as writer:
for k in range(actual_size_of_shard):
## Edit from here
x_ = x[size_of_shard*j+k].tobytes()
y_ = y[size_of_shard*j+k].tobytes()
## to here
example = serialize_example(x_,y_)
writer.write(example)

# Copy .tfrec files from colab to buckets. Edit your bucket location here.
!gsutil -m cp -r *.tfrec gs://medium_test_bucket/tfrec_files/

Hope you can convert your required dataset to .tfrecord format using the above template and transfer them to your bucket in GCP.

Next, let’s see how to create the tf.data pipeline for training.

## Make sure to change the runtime to a TPU

## Necessary imports
import tensorflow as tf
import keras
from keras import layers

## Initialize TPU. Nothing to change here
print("Tensorflow version " + tf.__version__)
try:
tpu = tf.distribute.cluster_resolver.TPUClusterResolver() # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])
except ValueError:
raise BaseException('ERROR: Not connected to a TPU runtime; please see the previous cell in this notebook for instructions!')
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
tpu_strategy = tf.distribute.TPUStrategy(tpu)

## Create Model. Change your model architecture here.
def create_model():
model = keras.Sequential(
[
layers.Input((16,16,1)),
layers.Flatten(),
layers.Dense(32),
layers.Dense(1, activation='sigmoid')
]
)
model.compile(
optimizer='adam',
loss = 'binary_crossentropy',
metrics=['accuracy']
)
return model
with tpu_strategy.scope():
model = create_model()
model.summary()

## Parse function. This is where you convert the tfrecord files back to arrays. You can do any preprocessing if needed in the parsing function
def parse(serialized,x_shape=(16,16,1),y_shape=(1,)):

features = {'x': tf.io.FixedLenFeature([], tf.string), 'y': tf.io.FixedLenFeature([], tf.string)}
parsed_example = tf.io.parse_single_example(serialized=serialized, features=features)

x = parsed_example['x']
y = parsed_example['y']

x = tf.io.decode_raw(x, tf.float64)
x = tf.reshape(x, shape=x_shape)

y = tf.io.decode_raw(y, tf.int64)
y = tf.reshape(y, shape=y_shape)

return x, y

# Input Pipeline. You can look into why each of these are done
AUTOTUNE=tf.data.AUTOTUNE
shards = tf.io.matching_files('gs://medium_test_bucket/tfrec_files/*.tfrec')
shards = tf.random.shuffle(shards)
shards = tf.data.Dataset.from_tensor_slices(shards)
dataset = shards.interleave(lambda x: tf.data.TFRecordDataset(x), num_parallel_calls=AUTOTUNE)
dataset = dataset.shuffle(buffer_size=500)
dataset = dataset.map(parse, num_parallel_calls=AUTOTUNE)
dataset = dataset.batch(128)
dataset = dataset.prefetch(AUTOTUNE)

# model training
model.fit(dataset, epochs=2)

There you go. That’s how you train a model using a large dataset that doesn’t fit onto RAM using TPUs with the aid of Google buckets.

--

--