RAPIDS 23.10 Release

RAPIDS Goes CPU/GPU (zero code change!), XGBoost 2.0, Improved Vector Search, and more

Nick Becker
RAPIDS AI
7 min readNov 16, 2023

--

You can now GPU-accelerate pandas with zero code change using RAPIDS cuDF

The RAPIDS 23.10 release takes a huge step forward in breaking down barriers to bringing accelerated computing to the data science community. RAPIDS now enables a zero code change CPU/GPU user experience for dataframes, graph analytics, and machine learning. This release also introduces XGBoost 2.0 and makes substantial improvements to accelerated vector search and text processing for LLMs.

Table of Contents

RAPIDS Goes CPU/GPU

On November 8th, NVIDIA hosted the AI and Data Science Virtual Summit, bringing together experts from across the community to discuss how accelerated computing is advancing data science.

At the Summit, we announced major enhancements to RAPIDS cuDF, cuGraph, and cuML that bring a unified, zero code change CPU/GPU experience to dataframe, graph analytics, and machine learning workflows. Below, we summarize these key enhancements. For those interested in learning more, we encourage you to register for free for the Summit to watch the session replays and stay tuned for in-depth blogs in the future.

cudf.pandas

Pandas is the quintessential tool for data scientists today. With 9.5 million users, it’s the most popular library in Python for working with tabular data. But, it becomes slow as dataset sizes grow into the gigabytes.

At the Summit, we announced that cuDF’s new pandas accelerator mode (cudf.pandas) solves this problem, bringing the speed of cuDF to every pandas workflow with zero code change required. It works with most third-party libraries that operate on pandas objects and will accelerate pandas operations within these libraries, too. Just load cudf.pandas to accelerate your workflow on the GPU, with automatic CPU fallback if needed.

This new mode is available in the standard cuDF package. To accelerate IPython or Jupyter Notebooks, use the magic:

%load_ext cudf.pandas
import pandas as pd

To accelerate a Python script, use the Python module flag on the command line:

python -m cudf.pandas script.py

cudf.pandas is designed to accelerate workflows where pandas struggles with performance, so the 5 GB scale DuckDB/H2O.ai Database-like Ops Benchmark perfectly illustrates the impact of GPU-accelerating your pandas code. The benchmark requires executing a variety of common merge and groupby operations.

By just “flipping the switch”, we can turn minutes of processing into just one or two seconds.

Figure 1. Performance comparison between Traditional pandas v1.5 on Intel Xeon Platinum 8480CL CPU and pandas v1.5 with RAPIDS cuDF on NVIDIA Grace Hopper

You can learn more about this benchmark in the cuDF documentation. And you can learn more about how cudf.pandas works at rapids.ai/cudf-pandas and test-drive the introductory notebook in a free GPU-enabled notebook environment on Google Colab.

nx-cuGraph

With more than 27 million downloads per month, NetworkX is the go-to library for graph analytics in Python thanks to its ease of use, wide selection of algorithms, and fantastic community.

But as datasets and graphs problem sizes grow, performance of NetworkX’s pure Python implementation becomes a significant hurdle, forcing users to stop using their favorite library or wait potentially hours for results. Over the past year, we’ve been collaborating with the NetworkX community to develop backend dispatching capabilities that can address these challenges.

We’re excited to share that cuGraph can now be used as a backend for NetworkX through our nx-cugraph package, enabling NetworkX users to GPU-accelerate their workflows with zero code change.

Just set an environment variable and your workflow will use cuGraph if it’s available and the algorithm is supported, falling back to standard CPU-based NetworkX otherwise.

export NETWORKX_AUTOMATIC_BACKENDS=cugraph
python my_nx_app.py

That’s it! Your NetworkX code will now use the GPU for all supported algorithms, enabling you to access up to 600x speedups when processing graphs like the US patent citation network dataset with 3.7 million nodes and 16.5 million edges.

Figure 1. nx-cugraph on NVIDIA H100 vs. NetworkX on Intel Xeon Platinum 8480CL CPU.
Dataset: US Patent citation network 1975–1999 hosted by Stanford SNAP

nx-cugraph currently includes support for three algorithms and we aim to support 12 algorithms in the 23.12 release.

You can install nx-cugraph using either conda or pip:

pip install nx-cugraph-cu11 — extra-index-url https://pypi.nvidia.com
conda install -c rapidsai -c conda-forge -c nvidia nx-cugraph

To learn more about nx-cugraph and this benchmark, please visit this in-depth blog.

cuML CPU/GPU

In this release, we’ve significantly expanded cuML’s CPU/GPU capabilities. The majority of cuML estimators now support both GPU-based and CPU-based execution capabilities with zero code change required to switch between them. A subset of estimators even support exporting models across hardware, enabling you to train and run inference on different hardware.

Now, you can prototype using cuML in your workflows even on systems without access to a GPU. When you’re ready, take the same code and run it on a GPU-enabled system to tap into the power of accelerated computing:

import cuml # no change is needed even for the import!
from cuml.manifold.umap import UMAP

X, y = …

# define the cuml UMAP model and use fit_transform function to obtain the
# low dimensional representation of the input dataset
embeddings = UMAP(
n_neighbors=10, min_dist=0.01, init=”random”
)

transformed_embeddings = embeddings.fit_transform(X)

To get started prototyping with cuML on CPU-only machines, you can install via conda:

conda install -c rapidsai -c nvidia -c conda-forge cuml-cpu=23.10

To learn more about these capabilities, please visit the cuML on GPU and CPU documentation.

XGBoost 2.0

In partnership with the XGBoost community, we released XGBoost 2.0 in September! This is a huge milestone for the project and a testament to the incredible group of contributors and users driving the project forward over the past few years. Today, XGBoost is downloaded more than 2.5 million times per week and is used across nearly every industry.

XGBoost 2.0 is chock full of huge improvements to both performance and user experience, but we’ll spotlight several below.

Unified GPU interface with a single device parameter

Using a GPU for different tasks within XGBoost tasks historically required setting a variety of different parameters like gpu_hist, tree_method, gpu_predictor, gpu_id, and more. Now, all of these capabilities are configurable with a single, simple device parameter that controls CPU or GPU execution.

Quantile Regression

XGBoost now supports quantile regression, a popular technique used for probabilistic forecasting scenarios in which you care about parts of the distribution beyond just the conditional mean. This may be particularly valuable for use cases in which outcomes in the tails of the distribution are particularly impactful or important, such as in supply chain forecasting.

The quantile loss (also known as pinball loss) is supported on both CPUs and GPUs.

PySpark Interface

The official XGBoost PySpark interface is now much more mature and ready for wider use! With support for GPU-based training and predictions, improved logging, performance optimizations, computing SHAP-based feature contributions, and more, we’re excited to see what the Apache Spark community creates!

To learn more, visit the XGBoost Python documentation.

Accelerated Vector Search and Text Processing

Accelerated Vector Search with RAFT

23.10 brings substantial performance and functional enhancements to CAGRA, the GPU-accelerated graph-based approximate nearest neighbors technique with world class performance for large batch queries, single queries, and graph construction time.

Pre-Filtering

Pre-filtering allows removing irrelevant records before querying our vector database index to ensure we return high quality results. As the team at Pinecone noted, “Pre-filtering is excellent for returning relevant results, but significantly slows-down our search…”

With support for pre-filtering now available, CAGRA can address these CPU-based performance challenges and enable returning high quality results.

Nearest Neighbor Descent

Building vector indexes is a computationally challenging problem well suited for GPUs and nearest neighbor descent is a state-of-the-art technique for iteratively constructing a k-nearest neighbors graph. Our graph-based CAGRA algorithm can also use nn-descent to improve the initial graph-construction performance over 10x compared to using IVF-PQ to construct the graph.

Faster JSON parsing with cuDF and Dask

Training large language models can require processing terabytes of text data, often representing essentially the entire internet! cuDF and Dask has emerged as a great combination to efficiently process the documents in these training pipelines.

In the 23.10 release, we’ve made algorithmic enhancements to the cuDF JSON reader that improve performance for reading the types of files common in large-language model training pipelines. In our benchmarks, we now observe end-to-end read throughput up to 70% of the theoretical limit with H100 GPUs reading from local NVME drives.

If you’re using cuDF as part of your LLM training pipeline, you can expect to immediately see performance gains.

Summary

At the AI and Data Science Summit, we put forward a commitment to meeting data scientists where they are today. The RAPIDS 23.10 release is a major milestone in bringing accelerated computing into the day-to-day workflows of data scientists and engineers.

Pandas and NetworkX users can now GPU-accelerate their code with zero code change. cuML now provides a large suite of CPU/GPU capabilities. XGBoost 2.0 dramatically simplifies using GPUs. And we’re continuing to improve vector search and text processing to empower emerging technologies like LLMs.

To get started using RAPIDS, visit the Quick Start Guide.

--

--