RAPIDS 24.08: Better scalability, performance, and CPU/GPU interoperability

Published in

RAPIDS AI

5 min read1 day ago

RAPIDS 24.08 is now available with significant updates geared towards processing larger workloads and seamless CPU/GPU interoperability.

cuDF pandas accelerator mode now allows you to process larger-than-GPU memory workloads, as well as supports columns with two billion rows of strings.
RAPIDS cuVS achieves better search accuracy by ensuring connected graphs, and maintains the accuracy as you add additional vectors. cuVS CPU/GPU interoperability has also been improved by providing a C++ API within cuVS for graph conversion.
cuML now accelerates dimensionality reduction of much larger datasets with UMAP by up to 300x.

Read on for more details!

Process workloads larger than GPU memory with cuDF pandas accelerator mode

cuDF’s pandas accelerator mode — the zero-code-change accelerator for pandas workflows — can now speed up processing of much larger datasets and workloads that exceed available GPU memory. To achieve this, cudf.pandas now uses CUDA Unified Memory (managed memory) by default. Unified Memory enables CUDA programs to oversubscribe GPU memory by automatically migrating memory pages between CPU and GPU memory. To ensure high performance, cuDF now also automatically prefetches managed memory to the GPU before accessing it.

In Figure 1, the benchmark for accelerated pandas, running data processing workloads with a 10 GB dataset using cuDF achieved up to 30x speedups compared to CPU-only pandas for data joins on an NVIDIA T4 GPU with 16 GB memory. It is important to note that queries shown in Figure 1 require allocating more memory than available on the GPU. Before cuDF 24.08, this would have caused fallback to CPU-only pandas.

Figure 1. DuckDB Data Benchmark (with a 10 GB dataset) performance comparison between cuDF pandas and traditional pandas v2.2 ; HW: NVIDIA T4, CPU: Intel Xeon Gold 6130 CPU; SW: pandas v2.2.2, RAPIDS cuDF 24.08

For more information about these benchmark results and how to reproduce them, see the cuDF benchmarks guide .

The second big improvement introduced in cuDF 24.08 is support for large string columns. Previously, dataframes could only contain 2.1 billion characters per column because indexes were based on 32-bit integers. Now cuDF dynamically switches between 32-bit and 64-bit indices when a string column’s data exceeds 2.1B characters. This allows cuDF to keep a lower memory footprint and higher processing speed and still support efficient processing for large string columns. Together, these two new features enable you to process much larger datasets with much more textual data.

Accelerating UMAP in cuML with faster knn graph building

You can now speed up dimensionality reduction and visualization of massive datasets using UMAP with the addition of a faster algorithm for building the initial knn graph. This feature adds a new `build_algo` argument to the UMAP estimator in cuML, and by default this is set to `auto` so most users don’t need to do anything to take advantage of this new capability. To force the faster graph building, `build_algo` can be set to use `nn_descent` ANN algorithm. Of course `brute_force` can still be used for great performance on the GPU when the original exact knn graph build algorithm is preferred. To start using this algorithm, you just need to set the `build_algo` parameter to `nn_descent` like below.

from cuml.manifold.umap import UMAP
data = generate_data()
umap_nnd = UMAP(build_algo = "nn_descent") # defaults to "auto"
Embedding = umap_nnd.fit_transform(data)

Figure 2: Time to build knn graph for different dataset sizes with `NN Descent` and `Brute Force` graph building algorithms

As shown in the above graph, the speed up can be up to 300x for a 30GB dataset on an H100 GPU (GPU Memory: 80GB) with no material impact to the embedding quality as seen by the trustworthiness score. In the upcoming release, UMAP will also be capable of processing datasets larger than GPU memory through a novel batching process. Stay tuned for that!

Seamless CPU / GPU index interoperability and improved vector search with cuVS

cuVS, the RAPIDS vector search library, has now further improved seamless execution between CPU and GPU, both for vector index building and search tasks. While CAGRA, the GPU-native graph-based approximate nearest neighbors (ANN) index, tends to have lower latency than the CPU overall, this interoperability between GPU and CPU offers a significant cost reduction by offloading the most compute-intensive parts to the GPU. For your library development work, CuVS now has a C++ API that allows converting CAGRA indexes which were built on the GPU to HNSW (loaded via HNSW API wrapper in cuVS) on CPU. You will also be able to use the C and Python-based APIs to perform the above index conversion in the future releases. The code snippet below shows a sample C++ code that you can use to convert the indexes.

#include <raft/core/device_resources.hpp>
#include <cuvs/neighbors/cagra.hpp>
using namespace cuvs::neighbors;
raft::device_resources res;
cagra::index_params index_params;
auto index_vectors = raft::make_device_matrix<float, int64_t>(res, n_vecs, dim);
// … populate index vectors
auto cagra_index = cagra::build(res, index_params, index_vectors);
auto hnsw_index = hnsw::from_cagra(res, cagra_index);

You can now also add vectors to an existing CAGRA index, another highly requested feature. Adding vectors to the CAGRA graph while maintaining search accuracy has been a challenge. Now, you can add 20% additional vectors without seeing a significant degradation in search accuracy. See the below example demonstrating how the C++ based API can be used to add vectors to an existing CAGRA index.

#include <raft/core/device_resources.hpp>
#include <cuvs/neighbors/cagra.hpp>
using namespace cuvs::neighbors;
raft::device_resources res;
cagra::index_params index_params;
auto index_vectors = raft::make_device_matrix<float, int64_t>(res, n_vecs, dim);
// … populate index vectors
auto index = cagra::build(res, index_params, index_vectors);
auto new_vectors = raft::make_host_matrix<float, int64_t>(res, n_new_vecs, dim);
// … populate new vectors
cagra::extend_params extend_params;
cagra::extend(res, extend_params, raft::make_const_mdspan(new_vectors.view()), index);

Lastly, you can improve the accuracy of the vector search by ensuring that the CAGRA graph is connected. Now, you can enable the build parameter option to guarantee that the graph is connected, ensuring comprehensive traversal across nodes and improving the search performance. See the below example of how you can set the `guarantee_connectivity` parameter in the CAGRA API as true to ensure connected graphs.

#include <raft/core/device_resources.hpp>
#include <cuvs/neighbors/cagra.hpp>
using namespace cuvs::neighbors;
raft::device_resources res;
cagra::index_params index_params;
index_params.guarantee_connectivity = true;
auto index_vectors = raft::make_device_matrix<float, int64_t>(res, n_vecs, dim);
// … populate index vectors
auto index = cagra::build(res, index_params, index_vectors);

Enhanced CUDA 12.5 support and adjustments to dependency versions

RAPIDS has a few dependency support changes that users should be aware of. For more, please see our RAPIDS Notices page.

RAPIDS added CUDA 12.5 support in the 24.08 release. So RAPIDS 24.08 CUDA 12 packages are now installable and usable with CUDA 12.0–12.5. Docker containers for CUDA 12.5 are also included.
RAPIDS deprecated support for CUDA 12.2 in its Docker containers in the 24.08 release; RAPIDS 24.08 will be the last version to release CUDA 12.2 Docker containers. More details can be found in the RAPIDS Support Notice (RSN) here.
RAPIDS is dropping support for Python 3.9 in its upcoming 24.10 release. RAPIDS will simultaneously add support for Python 3.12. More details can be found in the RAPIDS Support Notice (RSN) here.

Conclusion

The RAPIDS 24.08 release takes another step forward in our mission to make accelerated computing more accessible to data scientists and engineers. We can’t wait to see what people do with these new capabilities.

To get started using RAPIDS, visit the Quick Start Guide.