Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

GPUs Are Fast! Datasets Are Your Bottleneck

3 min readMay 19, 2021

--

Bad data practices WILL slow down your training (Photo credit: pixabay)

If you’re using machine learning or deep learning then you’ve likely obsessed over making sure all your code can run on GPUs or, for the brave souls, even TPUs.

I hate to be the bearer of bad news, but your models are already likely pretty optimal for GPUs! (especially if you’re using frameworks like PyTorch Lightning that automatically let you switch between GPUs and CPUs with no code changes).

The real culprit is data thoroughput. Data can be a bottleneck in subtle ways you may not be aware of.

Transforms on CPUs

When dealing with data for deep learning, you are very very likely transforming your inputs in some way. For instance, if you work with images, you likely apply a transformation pipeline to these.

For example, here’s a typical computer vision pipeline:

Image transforms (Author’s own)

The dirty secret is that today these transforms are applied one input at a time on CPUs… This means that they are super slow. If your models allows you to, you could apply the same transforms to a batch of data at once on GPUs.

Kornia is a library that helps you do these transforms in GPU memory

Disk throughput

The next major bottleneck is how fast you can read samples from disk. It doesn’t matter if you have the world’s fastest models and GPUs… your data has to go from disk -> CPU -> GPU.

Inside a GPU machine (Author’s own)

So, that disk on your machine? Better be a really good SSD.

On the cloud this becomes even more relevant… For instance if you start running dozens of models on cloud machines that all need the same data under the hood then you’re going to have to build pretty optimal systems for that. S3 is simply not going to cut it when training at scale!

When running cloud jobs on AWS via Grid AI you can use Grid datastores which are optimized for running at scale. Simply create a datastore:

And grid will make sure to optimize it when models run at scale (even across hundreds of GPUs and many simultaneous models).

Then to run the code on the cloud simply use grid run

Simultaneous data loading

Preloading data using num_workers (Author’s own)

Another place prone to mistakes is the class that loads your datasets into your machine learning code. PyTorch has a Dataloader abstraction that takes care of batching and pre-loading data. However, to take full advantage make sure to use the num_workers argument which will preload batches automagically for you in the background

The question is always what should num_workers be? Usually, it’s some multiplier over the number of CPUs on your machine.

If you use a framework like PyTorch Lightning it will even recommend the num_workers for you!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

William Falcon
William Falcon

Written by William Falcon

⚡️PyTorch Lightning Creator • PhD Student, AI (NYU, Facebook AI research).

No responses yet