RAPIDS 23.08 Release

New features to GPU accelerate large-scale text processing, vector search, spatial analytics and much more!

Alex Ravenel
RAPIDS AI
10 min readAug 23, 2023

--

The RAPIDS 23.08 release was a big one, with new features to support large-scale text processing and LLMs in both cuDF and RAFT, major new functionality to cuXfilter and cuSpatial (and the new cuProj library for coordinate transforms), impressive performance numbers from cuGraph, and major changes to infrastructure and packaging (CUDA 12.0 conda packages, docker container changes). cuSignal is also being ported into CuPy (and will no longer be released as a standalone package) — we’ve seen how users frequently combine cuSignal and CuPy in workflows, and are excited to unify these tools.

Get started with the quick start installation instructions for conda, pip, or docker or read on to see more details on this month’s upgrades!

Table of Contents

Docker Container Updates

We’ve completely revamped our Docker containers to simplify and improve the experience. A quick summary below, but please note there are some breaking changes so for a full rundown please see this GitHub Issue describing the full impacts.

  • All containers are now multi-arch (supporting both ARM and x86–64) so there’s no need to specify architecture
  • To standardize installations, all containers (except for CUDA 11.2 containers) are based on Ubuntu 22.04. CUDA 11.2 is not supported in Ubuntu 22.04, and thus is still based on Ubuntu 20.04
  • There are now only Base and Notebooks containers
  • Images are now inheriting from nvidia/cuda’s base images, which contain only cuda-cudart & cuda-compat (no other libraries, headers, etc.)
  • Everything else CUDA-related comes from Conda packages
  • All images start with a non-root user, which should improve general usability
  • Conda packages are installed in Conda’s base environment, which should simplify usage when activation is not an option
  • The RAPIDS base image (Alt) starts in an ipython shell and includes what’s necessary to run RAPIDS
  • The RAPIDS notebooks image (Alt) includes RAPIDS notebooks and starts a JupyterLab server and adds our standard suite of Jupyter Notebooks to sample a taste of what RAPIDS has to offer

CUDA 12.0 and Conda packaging

RAPIDS 23.08 adds support for CUDA 12.0 Conda packages! The team worked hard to create and merge new CUDA packages on conda-forge for CUDA 12.0 that allow for a much more flexible CUDA future. When building Conda packages using the new model, no libraries are pulled in outside of cuda-cudart unless explicitly specified in the requirements/host for CUDA 12. Also note that cuda-cudart is statically linked by default, if users want different behavior they need to add cuda-cudart-dev to requirements/host and possibly tweak the compiler flags accordingly. Currently conda-forge is in the process of migrating packages for CUDA 12 following this guidance, which is generally useful. Here’s an example of these changes:

To enable this improvement, the cuda-version package will supersede cudatoolkit as the way to specify the CUDA version. Note that the cuda-version works for both prior CUDA versions (like 11.x) and newer ones (like CUDA 12.0). Given the restructuring of the CUDA Conda packages, the cudatoolkit will no longer exist in the new Conda packaging model (starting with CUDA 12.0). An example installation of RAPIDS, using cuda-version, is shown in the figure below.

RAPIDS XGBoost Package

The RAPIDS 23.08 release also includes a revamped XGBoost package. This is now more in line with the conda-forge package (with many changes already upstreamed to conda-forge). The package is now maintained in this repo.

cuDF — ETL and data processing

cuDF 23.08 brings new tools for accelerating LLM pipelines and processing document data at scale. Thejaccard_index function enables 6x faster runtimes and hash_character_ngrams enables up to 5x larger batch sizes. These functions build on the >40x speedups from minhash (released in 23.06).

We also released Expression-based IO filtering in libcudf for the Parquet reader. Now you can efficiently target your file reads to a subset of interest — this is critical for network-attached and cloud storage.

In this release we also implemented a number of order-of-magnitude optimizations for the existing APIs. The ORC writer now writes files with string columns up to 10x faster. The ORC reader also got a dramatic performance boost — more than 20x faster runtimes when reading list columns with high row counts. Unbounded window functions are now 10–15x faster.

Finally we are introducing the pylibcudf subpackage and developer guide. pylibcudf is a lightweight Cython wrapper for direct and efficient libcudf usage, designed to be the new core of cuDF-python as well as performance-critical feature engineering.

RAFT and Vector Search

A diagram showing how CAGRA can map subgraphs of its index to separate thread blocks, enabling parallelism even for a single query.
CAGRA is using multiple thread blocks to visit more graph nodes in parallel. This is maximizing GPU utilization for every single-query search.

RAFT has added a new index to its vector search toolbelt, as the novel state-of-the-art graph-based approximate nearest neighbors algorithm, CAGRA, has been promoted from experimental status and is now ready to accelerate your vector similarity search applications. CAGRA demonstrates superior performance to other GPU-accelerated indexes when given large numbers of queries at a time but it has been purpose-built to increase GPU utilization even when searching for only a single vector at a time. CAGRA’s graph-based index is a low-latency and high-throughput alternative to both SCaNN and HNSW, which are the current state-of-the-art on the CPU.

Bar chart comparing throughput performance (queries per second) for RAFT’s GPU algorithms against HNSW on the CPU.
Vector search throughput at 95% recall on DEEP-100M dataset, batch size of 10.

In addition to faster search, CAGRA also reduces the amount of time spent building indexes. Version 23.08 of RAFT cuts index building time nearly in half compared to HNSW and this will be improved even further in upcoming releases.

CAGRA comes with both a C++ API and lightweight Python API and will soon be available in Milvus and Redis.

cuProj: Blazing Fast Geographic Coordinate Transformations

The cuSpatial 23.08 introduces a new subproject called cuProj. cuProj is a library and Python package for accelerated geographic and geodetic coordinate transformations. cuProj can transform billions of geospatial coordinates per second from one coordinate reference system (CRS) to another on GPUs.

cuProj provides a Python API that closely matches the popular PyProj API, and a header-only C++ API designed to implement many of the same transformations provided by the widely used Proj. Currently cuProj supports a subset of the Proj transformations and projections, specifically conversion of WGS84 (latitude/longitude as used by GPS) to and from the Universal Transverse Mercator (UTM) standard. Conda and Pip packages for cuProj are now available.

Comparison of throughput of cuProj to PyProj for WGS84 to UTM projection. Note the log vertical scale. The benchmarks were run on an NVIDIA DGX H100. PyProj used one Xeon Platinum 8480C CPU core (PyProj is single-threaded). cuProj used one NVIDIA H100 GPU (80GB).

The chart in Figure 1 shows performance of cuProj compared to PyProj in a benchmark similar of WGS84 to UTM projection. cuProj running on an NVIDIA H100 GPU achieves a peak speedup of over 4100x vs. PyProj running on an Intel Xeon Platinum 8480C CPU. For perspective, cuProj can transform 1 billion points in about 30ms: interactive rates. PyProj takes nearly 2 minutes to do the same.

cuSpatial: Fast cartesian distance computation and binary predicate improvements

Aside from cuProj, multiple improvements went into cuSpatial 23.08. The GeoSeries.distance API has been added to compute the Cartesian distances between two GeoSeries on the GPU. Depending on the input GeoSeries type, this API supports any combinations of GeoSeries between Points, MultiPoint, Multilinestring and Multipolygon geoseries. It provides a GeoPandas-like interface (including the align argument). Given 10 million pairs of 20-segment linestrings and 38-sided polygons, cuSpatial achieves a 66x speedup compared to GeoPandas in computing the linestring-polygon distance.

Multiple bug fixes have gone into 23.08 to increase the stability and correctness of the .contains API. When a linestring shares a boundary point with a polygon, while none of its endpoints are in the polygon, it is now computed as “not contained” to match GeoPandas. In addition, multi*_range classes now have 100% test coverage of public APIs. As an abstraction to geometry objects shared across many vectorized ops, increased test coverage increases API stability. For more information about the .distance API benchmark, binary predicate improvements and test coverage, look for an upcoming cuSpatial 23.08 blog post soon.

Cuxfilter — accelerated cross-filtering viz

Cuxfilter development didn’t take a break this summer, adding several major features and UX updates. For this release, cuxfilter now supports CUDA 12 and is available through pip! Using pip will facilitate faster and simpler installation, especially on cloud services like Colab and Studio Lab. Keen observers will also note a more polished and integrated UI, thanks in part to better holoviews and bokeh 3.0 integration. This allows for better widget tools, easier theme switching, and more robust drag-and-drop layouts.

Speed and string support improvements in cuDF resulted in us removing our long-running precomputed datatile feature. We also removed the bokeh line chart, since the datashader version is better suited for our needs. As a result, we are now able to implement string columns without mapping and our favorite new feature: ghosted linked brushing. See below for an example.

Using d.app() will now show much better laid out dashboards inline within a notebook. Dashboard and individual chart heights can now be directly set. As such, we decided to remove the somewhat confusing d.preview() static preview functionality.

Finally, we cleaned up our documentation and are working on an advanced cuxfilter features guide notebook. And in case you missed it, we just released an extensive visualization user guide notebook, spanning, hvplot, datashader, plotly dash, cuDF cuGraph, cuML, cuSpatial, AND cuxfilter. It’s a good one.

You can try all the new cuxfilter features out in Colab and Studio Lab right now.

cuGraph — graph analytics & GNNs

One of the cuGraph goals is to ensure that all our algorithms scale, we are happy to announce that the Edge Betweenness Centrality algorithm now supports scaling from a single GPU to multiple GPUs and even multiple GPUs across multiple nodes (what we call MNMG).

Supporting Graph Neural Networks continues to be a big push for cuGraph. This release we made major improvements to the bulk sampling function and added the ability to do MFG (message flow graph) creation as part of sampling. We tested on multiple graphs that are multiples of the OGB paper100M dataset:

3 billion edges (ogb_paper100M x 2) with a batch size of 512:

  • Old bulk sampler + post MFG creation: 77.283 seconds
  • New sampler with MFG: 32.39 seconds

6 billion edges (ogb_paper100M x4) and batch size of 512

  • Old bulk sampler + post MFG creation: 151.89 seconds
  • New sampler with MFG: 77.33 seconds

Increasing the batch size further improved performance. On 6 billion edges with a batch size of 16,384 the sampling with MFG time drops to only 11.45 seconds.

Release 23.08 also has cuGraph support for CUDA 12, except the cugraph-dgl and cugraph-pyg packages which will be CUDA 12 in release 23.10 due to a dependency on PyTorch which will have CUDA 12 in September. The WholeGraph package has also been refactored and is now available via conda and pip.

KvikIO — Improved Zarr support & high-speed IO

The focus of this release is Zarr compatibility and support of on-the-fly compression/decompression when reading or writing Zarr files. It is now possible to use nvCOMP’s batch API when accessing Zarr files, which allows multiple buffers to be compressed/decompressed together. A demo can be found in this notebook.

Additionally, Dask-CUDA has switched from using cuCIM to KvikIO for handling GDS-based spilling, and has been updated to leverage vectored IO, which hands off all reads/writes to system calls in the CPU case or KvikIO in the GPU case.

cuSignal — porting to CuPy

The cuSignal project is in the process of porting to CuPy and will be deprecated and archived in RAPIDS. The 23.08 release is the last formal release of cuSignal. More details can be found in RSN 32. We’ve seen how users frequently combine cuSignal and CuPy in workflows, and we’re excited to unify these tools to improve the user experience.

The cuSignal team thanks RAPIDS users for the years of support and is looking forward to its future as a core CuPy module.

Conclusion

We’re proud of what we’ve been able to achieve in RAPIDS 23.08, and can’t wait to share with you what is coming for RAPIDS 23.10+! We’d like to extend our thanks to our contributors and users, and hope that you will give RAPIDS 23.08 a try:

--

--