Bringing GPU support to Datashader: A RAPIDS case study

Published in

RAPIDS AI

4 min readFeb 10, 2020

Datashader is a Python library for rapidly creating principled visual representations of large datasets. It is written in pure Python and relies heavily on numpy, pandas, Dask, xarray, and the Numba Just-in-Time compiler. See the Datashader documentation for examples of the kinds of visualizations it is designed to create.

Visualization of 300 million data points (one per person in the USA) from the 2010 census

The RAPIDS project was announced in late 2018, and it has quickly grown to encompass a rich ecosystem of technologies for taking advantage of the full computational power of modern NVIDIA GPUs from Python.

This post describes our experience incorporating several RAPIDS technologies to bring GPU-acceleration support to Datashader.

Prior challenges to GPU-acceleration

It has long been an ambition of the Datashader project to provide optional support for GPU acceleration, but prior to the RAPIDS ecosystem this had been impractical for two primary reasons:

Every call to Datashader would have required transferring large data arrays to and from GPU memory, substantially reducing the potential performance gains.
Adding GPU support would have required writing and maintaining separate CPU and GPU implementations of many of Datashader’s core algorithms, significantly increasing the complexity of the codebase.

Avoiding data transfer with cuDF and CuPy

The two primary inputs to Datashader’s rasterization methods are pandas DataFrames and xarray Datasets. DataFrames are used for point, line, and area rasterization while Datasets are used for raster and quadmesh rasterization.

The cuDF library provides a pandas-like DataFrame that is backed by GPU memory, with DataFrame operations implemented as CUDA kernels. Similarly, CuPy provides a numpy-like ndarray that is backed by GPU memory which can be used as a drop-in replacement for the numpy ndarrays that are wrapped by xarray Datasets.

As of Datashader 0.9.0, cuDF DataFrames can be used as replacements for pandas DataFrames in point, line, and area rasterization. And as of Datashader 0.10.0, CuPy-backed xarray Datasets can be used as replacements for numpy-backed Datasets in quadmesh rasterization. In both cases, the output produced by the rasterization method is a CuPy backed xarray Dataset.

The RAPIDS ecosystem (cuDF and CuPy in this case) solves the data transfer problem because it makes it possible to execute entire data-processing and analysis pipelines on the GPU. With these updates, Datashader can now act as a data visualization operation within a RAPIDS pipeline which, as we’ll discuss below, can provide significant performance benefits.

Code reuse

Many libraries in the current RAPIDS ecosystem are designed as largely compatible drop-in replacements for libraries in the traditional PyData stack (cuDF for pandas, CuPy for numpy, cuML for scikit-learn, cuGraph for NetworkX, etc.), and they require a compatible GPU to operate.

In contrast, the goal for Datashader has always been to provide optional GPU support from within the existing code base, reusing the existing APIs. Accomplishing this without introducing significant complexity to the code base requires effective strategies for maximizing the amount of logic that can be shared between CPU and GPU code paths.

Code reuse: Data structures

Some of Datashader’s internal algorithms are implemented using the pandas DataFrame API and the numpy ndarray API. Thanks to the degree to which cuDF and CuPy are faithful to the pandas and numpy APIs, it was fairly straightforward to update these functions to support cuDF and CuPy.

Code reuse: Custom algorithms

As mentioned above, many of Datashader’s internal algorithms are implemented as custom Python functions that are compiled at run-time using Numba.

Numba has long supported CUDA as a compilation target, and with the growing interoperability of the RAPIDS ecosystem, this has become even more powerful. It is now possible to use Numba to write CUDA kernels that can operate on the contents of cuDF DataFrames and CuPy ndarrays without copying and without transferring data to and from GPU memory.

Top-level CUDA kernel functions (those wrapped with the numba.cuda.jit decorator) require a small amount of CUDA-specific execution logic and can only be invoked on a machine with a compatible GPU. However, these kernel functions can call standard CPU-compatible numba functions (those wrapped with the numba.jit decorator). This capability has made it possible to reuse significant portions of Datashader’s custom algorithms between the CPU and GPU code paths.

For a complete minimal example of this approach to CPU/GPU code reuse, see the 1D histogram example at https://anaconda.org/jonmmease/datashader-rapids-histogram-example.

Performance results

With the adoption of RAPIDS technologies, many Datashader use-cases can now be accomplished 1 to 2 orders of magnitude faster than the corresponding single-threaded CPU implementations*. The following example notebooks were executed on a consumer-grade GeForce RTX 2080 Ti with 11GB of GPU memory:

10–15x speedup for generating a visualization of 300 million points: https://anaconda.org/jonmmease/datashader-rapids-census.
120x speedup for generating a visualization of a 67 million element quadmesh: https://anaconda.org/jonmmease/datashader-rapids-quadmesh.

* For many operations, Datashader supports multi-core, parallel, and out-of-core execution using Dask and Numba. This can be used to parallelize workloads across multiple CPUs or GPUs, but is out-of-scope for this post. For more information, see the Performance documentation section.

Learn more

If you’d like to learn more about Datashader, please check out the documentation and the examples at PyViz Topics. And, if you’d like to learn more about the RAPIDS ecosystem, check out the rapids.ai homepage.