Reading Larger than Memory CSVs with RAPIDS and Dask

Published in

RAPIDS AI

2 min readOct 22, 2020

Sometimes, it’s necessary to read-in files that are larger than can fit in a single GPU. Within RAPIDS, Dask cuDF makes this easy - whether we plan to analyze these files on one GPU or spread our analysis across multiple GPUs for even faster results.

On a workstation with an Intel Xeon Gold 6128 CPU with 12 logical cores, reading a 100 million row, 58 GB CSV file with a single NVIDIA Quadro 8000 GPU was 2–3x faster than reading it with the equivalent Dask CPU cluster. Reading the same file with two Quadro 8000 GPUs was 3–4x faster.

In the following Jupyter notebooks, we demonstrate how you can do larger-than-memory workflows on a single GPU and how you can distribute your work across multiple GPUs. All you need to do is spin up a LocalCUDACluster and you’re off to the races.

Single GPU, Larger than Memory CSV

In this notebook, we:

Spin up a LocalCUDACluster with a single GPU
Read our 58GB file into memory in 2 GB chunks
Do one set of complex operations, which illustrates how Dask automatically handles spilling from GPU to CPU memory when we would otherwise go out of memory
Compare reading data with Dask on GPUs to Dask on CPUs

Multiple GPUs, Large CSV

In this notebook, we:

Spin up a LocalCUDACluster with two GPUs
Read our 58GB file into memory in 3 GB chunks, highlighting how Dask takes care of distributing our data across both GPUs
Compare reading data with Dask on GPUs to Dask on CPUs

Conclusion

Dask and RAPIDS are powerful tools, and even better together. Want to get started analyzing larger than memory datasets? Check out the RAPIDS Getting Started webpage, with links to help you download pre-built Docker containers or install directly via Conda.

Reading Larger than Memory CSVs with RAPIDS and Dask

Single GPU, Larger than Memory CSV

Multiple GPUs, Large CSV

Conclusion

Written by Nick Becker