How to preprocess data using NVTabular on multiple GPUs?

Radek Osmulski
Published in
3 min readApr 13, 2023


Multi-GPU machines are becoming much more common. Training deep learning models across multiple-GPUs is something that is often discussed.

But what about preprocessing data by leveraging the resources of more than a single GPU?

NVTabular makes this super easy! Here is how to hit the ground running.

The first step to harnessing the power of multiple GPUs

The first thing you need to do is to start a LocalCUDACluster, which is provided by the RAPIDS Dask-CUDA library (this blog post discusses other types of clusters that can be initiated, however for a single machine with multiple GPUs LocalCUDACluster is the way to go).

A LocalCUDACluster will initiate a worker per GPU and will pool together all the resources we have available. It will enable our workers with mechanisms for device-to-host memory spilling and (optionally) enables the use of NVLink and InfiniBand-based inter-process communication.

The following code will instantiate the cluster for us:

And this is the output of the command:

Not only is our cluster started, but additionally we can now access the diagnostic dashboard at the given URL!

Device-to-Host Memory Spilling

What are other benefits that a LocalCUDACluster offers?

It enables the workers to move data between device memory and host memory, and between host memory and disk, to avoid out-of-memory (OOM) errors. To set the threshold for device-to-host spilling, a specific byte size can be specified with device_memory_limit. Do note though — since the worker can only consider the size of input data, and previously finished task output, this limit must be set lower than the actual GPU memory capacity.

Initializing Memory Pools

Last but not least, let’s initialize the memory pools.

Since allocating memory is often a performance bottleneck, it is usually a good idea to initialize a memory pool on each of our workers. When using a distributed cluster, we must use the utility to make sure a function is executed on all available workers.

This is the code we need to run:

Running data preprocessing

And that is really all that we have to do!

From there on, we can use the familiar NVTabular API and the library will automatically leverage the cluster for data storage an processing!

Here is the performance we got when running preprocessing of a modestly sized 20GB dataset across up to 8 GPUs of a DGX-1 system.

We see the biggest performance jump when going from 1 to 2 GPUs, however, likely bigger improvement gains can be expected when running on larger datasets.


In order to run NVTabular using multiple GPUs, the only difference to our workflow that we need to make is to instantiate the LocalCUDACluster. Thereafter, all the steps are the same, whether we are using a single GPU or more!

NVTabular picks up on the existence of a LocalCUDACluster and is able to use it without any additional input.

Thank you for reading, hope you found this article useful! You can find more details here, but the code is no longer maintained. For a further discussion of NVTabular and its capabilities, please see the examples here.



Radek Osmulski

I ❤️ ML / DL ideas — I tweet about them / write about them / implement them. Recommender Systems at NVIDIA