RAPIDS 24.02 release

Published in

RAPIDS AI

8 min readFeb 22, 2024

Zero code-change NetworkX, RAPIDS on Databricks with Dask, and more!

RAPIDS just released its 24.02 version, which brings with it a major expansion to the new zero-code-change experience for NetworkX users, improvements to XGBoost 2.0, and improvements to cuDF’s parquet and JSON readers. Additionally, RAFT now supports more flexible vector search deployments on CPU, RMM now supports pinned host memory, and RAPIDS now works on Databricks with Dask.

24.02 also brought with it a number of changes to platform and support dependencies (including the end of Pascal architecture support) — make sure to understand if any of these impact you!

Finally, we’d be remiss not to mention the return of NVIDIA’s GTC conference in-person this year! Data science has many talks and events–take a look at the data science talks here, and we hope to see you there!

Zero code change NetworkX
cuML + XGBoost 2.0
cuDF Improvements
RAFT + Vector Search
RMM Memory Manager
RAPIDS on Databricks with Dask
Infrastructure and Support Updates
NVIDIA GTC
Conclusion

Zero code change NetworkX

nx-cugraph is a backend for NetworkX that uses cugraph to GPU accelerate NetworkX workflows, and for release 24.02, nx-cugraph expanded the number of algorithms, bringing the total to 37, with an additional 41 functions for creating test graphs. NetworkX users can install nx-cugraph to experience GPU acceleration for algorithms such as ancestors, descendants, BFS, reciprocity, triangle counting, weakly connected components, and others, all without changing their code.

*Runtime for NetworkX bfs_tree algo for NetworkX (blue) vs. Accelerated NetworkX (purple). Total speedup varies based on the use case and if the graph is converted for use on GPUs prior to the call*

cuML + XGBoost 2.0

XGBoost 2.0 is now available in our conda channel! This brings all the improvements of XGBoost 2.0 including: a simpler and consistent API for CUDA execution, multi-target trees with vector-leaf outputs, GPU acceleration of the approximate tree method, rewritten learning to rank implementation, quantile regression among many other features. This release also includes RMM integration which enables using XGBoost 2.0 with memory pools that are shared with other RMM enabled libraries (including RAPIDS, PyTorch, CuPy and others), speeding up applications and allowing better use of GPU memory across complex workflows. Alongside FIL and Triton, we continue working to improve and accelerate both training and inference of gradient boosted models with GPUs.

cuML also now fully supports multi node-multi gpu Logistic Regression with dask as well as Spark. Using cuml.dask on a single node with 4 H100, this allows processing a 60GB dataset 3.5x faster than a single H100, and also allows processing of much larger datasets without needing to spill to system memory. For example, a single T4 (with 16GB of GPU memory) can process datasets of up to 12 GB, but 4 T4s elevate that to 46GBs. Not only does this allow scaling up for bigger datasets, it enables flexibility for users doing training of Logistic Regression models even with GPUs that have lower memory footprints or very diverse configurations. Sparse data processing will be available within the next few releases.

cuDF improvements

RAPIDS 24.02 includes improvements to libcudf’s chunked parquet reader, making it possible to reduce the memory footprint of parquet reads. For 536 MB tables, the chunked reader drops memory footprint from ~2.2 to ~1.4x peak memory footprint with a 200MB chunk size. Using results from the libcudf microbenchmarks, the chunked parquet reader uses 40% less peak memory while maintaining >85% of the data throughput. Chunked parquet reading is available in the C++ API, and we’re planning to add support to cuDF-python in the future (see more in this issue).

libcudf’s JSON reader also added new options for reading mixed data types in the same key and supporting single-quoted JSON variants. Previously cuDF-python would raise an exception if a JSON key had values of varying data types:

>>> j = '{"product":{"id":12}}\n{"product":[6,9]}'
>>> b = BytesIO(j.encode('utf-8'))
>>> pd.read_json(b, lines=True)
product
0 {'id': 12}
1 [6, 9]
# pandas returns python objects
>>> cudf.read_json(b, lines=True)
RuntimeError: CUDF failure at: /nfs/repo/cudf24.02pqchunk/cpp/src/io/json/json_column.cu:629: A mix of lists and structs within the same column is not supported
# cudf raises an exception
>>> cudf.read_json(b, lines=True, mixed_types_as_string=True)
product
0 {"id":12}
1 [6,9]
# cudf returns unparsed strings, instead of crashing or losing data

RAFT + Vector Search

The graph-based approximate nearest neighbors (ANN) index CAGRA can now be trained on the GPU and moved over to host (RAM) memory to be searched on the CPU with the popular HNSW algorithm. While the CAGRA index tends to have lower latency than the CPU overall, this interoperability between GPU and CPU offers a significant cost reduction by offloading the most compute-intensive parts to the GPU.

Below is a code example using RAFT’s Python API to build a CAGRA index on the GPU and load it into an HNSW index for searching.

import cupy as cp
from pylibraft.common import DeviceResources
from pylibraft.neighbors import cagra, hnsw
n_samples = 50000
n_features = 50
dataset = cp.random.random_sample((n_samples, n_features), dtype=cp.float32)
handle = DeviceResources()
index = cagra.build(cagra.IndexParams(), dataset, handle=handle)
hnsw_index = hnsw.from_cagra(index, handle=handle)

Benchmarking index build times for RAFT’s graph-based CAGRA algorithm against the industry-standard HNSW algorithm on the CPU. Benchmarking was done using all available cores of an Intel Sapphire Rapids CPU and an Nivida H100 GPU.

HNSW has become an industry standard for vector search because it is simple to use, works out of the box without minimal parameter tuning, and its low-latency search is often fast enough for many use-cases like retrieval augmented generation (RAG) and recommender systems. Unfortunately, while building the index can often be done in parallel, the resulting performance leaves something to be desired.

The chart below shows the average build times for indexes that performed at various levels of recall. It takes significantly less time to build the CAGRA algorithm on the GPU than building the HNSWlib algorithm on the CPU.

RMM Memory Manager Library

When copying data between GPU (device) and CPU (host), using pinned host memory results in much higher bandwidth and reduced synchronization. While RMM supported a host_memory_resource for allocating pinned memory in the past, this could not be used as the upstream memory resource (MR) for an RMM memory pool. RMM 24.02 includes a new host_pinned_memory_resource that allocates pinned host memory and can be used as the upstream MR for a pool_memory_resource. Pinned host pools are available in the RMM 24.02 C++ API and will be used in the near future to improve performance of Parquet reading and dataframe spilling in cuDF.

In this release RMM also moved further into a significant refactoring to use the new cuda::memory_resource concepts now included in libcu++. cuda::memory_resource and related features provides a C++ interface for heterogeneous, stream-ordered memory allocation tailored to the needs of CUDA C++ developers. The design builds on the success of RMM and evolves the design based on lessons learned. <cuda/memory_resource> is not intended to replace RMM, but instead moves the definition of the memory allocation interface to a more centralized home in the CUDA Core C++ Libraries (CCCL). RMM will remain as a collection of implementations of the cuda::mr interfaces. This refactoring will continue over the next few releases and include changes to other RAPIDS libraries.

With the new memory_resource concepts, instead of implementing a memory resource as a class that derives from rmm::mr::device_memory_resource and implements the pure virtual methods of the base class, now memory resource implementations only need to implement the cuda::mr::memory_resource and/or cuda::mr::async_memory_resource concepts. This simply means that the class must implement the appropriate allocate/deallocate and/or allocate_async/deallocate_async functions and comparison operators, as in the following (trivial) example. Please see the docs for more examples.

struct example_memory_resource {
  void* allocate(std::size_t, std::size_t) { return nullptr; }
  void deallocate(void*, std::size_t, std::size_t) {}
  void* allocate_async(std::size_t, std::size_t, cuda::stream_ref) { return nullptr; }
  void deallocate_async(void*, std::size_t, std::size_t, cuda::stream_ref) {}
  bool operator==(const valid_resource&) const { return true; }
  bool operator!=(const valid_resource&) const { return false; }
};
static_assert(cuda::mr::async_resource<valid_resource>, "");

A library can easily decide whether the memory resource supports the async (stream-ordered) interface:

template<class MemoryResource>
  requires cuda::mr::resource<MemoryResource>
void* maybe_allocate_async(MemoryResource& resource, std::size_t size, std::size_t align, cuda::stream_ref stream) {
  if constexpr(cuda::mr::async_resource<MemoryResource>) {
    return resource.allocate_async(size, align, stream);
} else {
  return resource.allocate(size, align);
  }
}

Note the function above uses the requires constraint from C++20 concepts, but it can be converted to work with earlier versions of C++ by using SFINAE instead of concepts.

RAPIDS on Databricks with Dask

RAPIDS has a long history of accelerating workloads on Databricks and with 23.12 that integration expanded to include leveraging multi-node GPU clusters with Dask.

If you’re running Spark workloads on Databricks you can use the RAPIDS Accelerator for Spark to speed up your existing workloads. But now with the new dask-databricks package you can also launch a Dask cluster alongside Spark on the same Databricks infrastructure. This is extremely useful if you want to try out packages like xgboost.dask on GPUs but use infrastructure you are already familiar with.

Adding Dask to your cluster is as quick as adding an init script that starts the Dask components with the RAPIDS dask-cuda package.

# Start the Dask cluster with dask-cuda workers
dask databricks run - cuda
Then you can get a Dask client from your Databricks notebook.
import dask_databricks
client = dask_databricks.get_client()

Check out the RAPIDS Deployment documentation to learn more.

Infrastructure and support updates

RAPIDS 24.02 brought with it a number of changes to platforms and dependencies that RAPIDS supports. Specifically:

RAPIDS 24.02 brought an end to support for Pascal GPUs (see more here). Effective this release, use of a Pascal GPU will either fail or return invalid results.
RAPIDS also no longer publishes containers supporting CUDA 11.2 (see more here). We continue to publish containers for CUDA 11.8 and 12.0. Users who wish to continue using CUDA 11.2 can do so via conda installation.
Starting with the RAPIDS 24.04 release, we are going to deprecate our containers for CentOS 7. These containers will no longer be published starting RAPIDS 24.06. See more here.
Finally, RAPIDS will stop supporting Pandas 1.x effective with the 24.04 release (see more here). Effective RAPIDS 24.04, only Pandas 2.x series will be supported.

NVIDIA GTC

The in-person GTC experience is back this year! Come connect with a dream team of industry luminaries, developers, researchers, and business strategists helping shape what’s next in AI and accelerated computing.

Data science topics, including RAPIDS, will be heavily represented. Data science talks will include:

RAPIDS in 2024: Accelerated Data Science Everywhere
XGBoost is All You Need
Accelerating Pandas with Zero Code Change using RAPIDS cuDF
Large-Scale Graph GNN Training Accelerated With cuGraph
And more!

See the full list of data science topics here. We hope to see you there!

Conclusion

We’re excited to get RAPIDS 24.02 out into the wild and to see what people build with it. We hope to see you at GTC!

To get started using RAPIDS, visit the Quick Start Guide.