RAPIDS 0.9: A Model Built To Scale

Published in

RAPIDS AI

8 min readAug 28, 2019

If you know me, you know I love to reference songs, sports, movies, and TV shows as analogies. The RAPIDS 0.9 release brought to mind another classic “9” that is remembered fondly by many of our team members, Star Trek: Deep Space Nine (DS9). The show has many lines that are great analogies for RAPIDS. For instance:

Gowron: “Think of it! Five years ago no one had ever heard of Bajor or Deep Space Nine, and now, all our hopes rest here.”

Five years ago, people did not associate NVIDIA with data science. Now, all our hopes for continuing to scale data science rest with RAPIDS. The RAPIDS team doesn’t take that lightly. Compute platforms aren’t built overnight, and everyone must plan for Life After Hadoop. Like DS9, this will take several seasons, but the best is yet to come.

This release (and this blog) focuses on one line from DS9 in particular:

Chief O’Brien: “It’s not a toy! It’s a model, built to scale.”

RAPIDS 0.9 is an exciting jump forward, with a number of new algorithms built to scale both up to more GPUs in a node, and out to multiple GPU nodes.

A Model Built to Scale: cuGRAPH

cuGraph, the RAPIDS graph analytics library, is happy to announce the release of a multi-GPU PageRank algorithm supported by two new primitives: a multi-GPU COO-to-CSR data converter and a function for computing vertex degree. These primitives are needed to convert source and destination edge columns from a Dask Dataframe into a graph format and enables PageRank to scale across multiple GPUs.

The following figure shows the performance of the new multi-GPU PageRank algorithm. Unlike previous PageRank benchmark runtime numbers that just measured the performance of the PageRank solver. This set of runtimes times include Dask dataframe to CSR conversion, PageRank execution, and conversion of results from CSR back to a DataFrame. On average, our new multi-GPU PageRank analytic is over 10x faster than a 100 node Spark cluster.

Figure 1: cuGraph PageRank compute times across a varying number of edges and NVIDIA Tesla V100s

The following figure looks just at the Bigdata dataset, 50 million vertices and 1.98 billion edges, and running the HiBench end-to-end test. The HiBench benchmark runtimes includes data reading, running PageRank, and then writing out the scores for all the vertices. HiBench was tested on Google GCP with 10, 20, 50, and 100 nodes.

Figure 2: cuGraph PageRank vs Spark GraphX for 50M edges across varying n1-standard-8 GCP nodes (lower is better)

cuGraph 0.9 also includes a new single-GPU strongly connected components function. For more on the future of cuGraph, please read the Towards Data Science blog on cuGraph’s journey to version 1.0, or the latest ZDnet article on cuGraph. Look forward to a blog in the near future detailing PageRank and benchmarks.

A Model Built to Scale: cuML

Release 0.9 brings two major milestones to cuML, the RAPIDS machine learning library. cuML 0.9 includes the first algorithm based on a new multi-node, multi-GPU framework: cuML-MG. cuML-MG combines the ease of use of the Dask framework with the highly-optimized OpenUCX and NCCL libraries for performance-critical data transfers. k-Means clustering is the first algorithm built on this library. It provides a familiar, scikit-learn-style API, but scales nearly linearly across GPUs and nodes using Ethernet or Infiniband. Across a pair of DGX-1 servers, k-Means-MG can cut the run time for a large clustering problem from 630 seconds on CPU to 7.1 seconds on GPU¹ .

cuML 0.9 expands random forest training, adding support for regression models and a new, layer-wise tree building algorithm that scales more efficiently to deep trees. Release 0.9 also brings initial support for multi-node, multi-GPU random forest building. With multi-GPU support, a single NVIDIA DGX-1 can train forests 56x faster than a dual-CPU, 40-core node².

A Model Built to Scale: cuML Training to Inference

Scale doesn’t stop with training. To really scale data science on GPUs, applications need to be accelerated end-to-end. cuML 0.9 brings the next evolution of support for tree-based models on GPUs, including the new Forest Inference Library (FIL). FIL is a lightweight, GPU-accelerated engine that performs inference on tree-based models, including gradient-boosted decision trees and random forests. With a single V100 GPU and two lines of Python code, users can load a saved XGBoost or LightGBM model and perform inference on new data up to 36x faster than on a dual 20-core CPU node³. Building on the open-source Treelite package, the next version of FIL will add support for scikit-learn and cuML random forest models as well.

Figure 3: XGBoost CPU vs Forest Inference Library (FIL) GPU inference speedup

Figure 4: XGBoost CPU and FIL inference time scaling as batch size increases (lower is better)

cuML will support inference for additional algorithms on GPUs in the future. If you would like to request accelerated inference support for a particular algorithm, please open an issue on GitHub.

A Model Built to Scale: XGBoost & Interoperability

RAPIDS aims to support a thriving ecosystem of GPU-accelerated data science software that works together to enable end-to-end computing. One example is XGBoost. RAPIDS is proud to continue supporting XGBoost, one of the most popular packages for building decision tree models. The RAPIDS team developed a new GPU-XGBoost bridge, which allows users to seamlessly move from cuDF dataframes to XGBoost training without the need to pass data through CPU memory. The bridge leverages the open __cuda_array_interface__ standard to make high-performance transfer possible from cuDF and will be extended for other libraries supporting the __cuda_array_interface__ standard in the near future. This change has been contributed upstream to XGBoost and will be available in its upcoming 1.0 release.

This Is Not a Toy: Dataframe Features Galore

cuDF, the RAPIDS data frame library built on Apache Arrow, 0.9 adds key features including print formatting, datetime support, categorical support, and ecosystem interoperability. Some Highlights are

Pretty printing: cuDF Series and DataFrames printed in the console or in a notebook are nicely formatted, similar to Pandas.
Expanded datetime supports units from nanoseconds to seconds granularity.
Dictionaries for categorical columns are now kept on the GPU, and new APIs increase cuDF Pandas compatibility.
cuDF 0.9 adds more bridges for ecosystem interoperability including direct __cuda_array_interface__ support for cuDF objects with the new mask attribute, as well as __array_function__ protocol support for cuDF objects based on NEP18.
Significantly, dask-cuDF is now merged with the cuDF repository, providing a single code base for ETL (though this is a transparent change for users downloading RAPIDS conda packages).

Underneath the RAPIDS ETL hood is libcudf, the C++ and CUDA library that provides the high-performance GPU implementation of cuDF. libcudf 0.9 adds a new GPU-accelerated Apache Avro reader, as well as improvements to other readers such as broader input type support, parsing hexadecimal numbers in CSV, and better ORC timestamp support. Version 0.9 also adds new algorithms such as cudf::merge for merging sorted columns and cudf::upper_bound and cudf::lower_bound, which enable searching for an array of keys in parallel on sorted columns and tables (exposed in Python via searchsorted). This release also adds mean, standard deviation, and variance aggregations on columns (cuDF Series). cudf::apply_boolean_mask and cudf::drop_nulls now operate on whole tables (DataFrame) rather than individual columns (Series). cudf::is_sorted checks whether a table/DataFrame is sorted, and cudf::nans_to_nulls enables converting floating point NaN values to a null bitmask.

In addition, cudf::rolling_window now supports just-in-time (JIT) compilation of user-defined functions (UDF). The Python cuDF interface allows you to pass a numba @cudajit function as a UDF to rolling windows. Underneath, cuDF gets the PTX assembly code for the function from numba, injects it into the rolling window CUDA kernel, JIT compiles it and executes. The compiled code is cached for high performance of repeated invocations. This is a generalization of the approach added for binary operations and cudf::transform in cuDF 0.8.

Far from being a toy, cuDF 0.9 is a big step in terms of robustness and functionality. Last, but certainly not least, you will notice significant speedups in cuDF this release including large performance improvements for join (up to 11x), gather and scatter on tables (which are 2–3x faster as well), and more as you can see in Figure 5.

Figure 5: cuDF vs Pandas speedups on a single NVIDIA Tesla V100 GPU vs dual Intel Xeon E5–2698 v4 CPUs (20-cores)

This Is Not a Toy! It’s a Model, Built to Scale: The RAPIDS Ecosystem

Many from RAPIDS were in Alaska a few weeks ago at KDD. BlazingSQL went open source, Iguazio published a blog about RAPIDS integration with Nuclio, and NVIDIA announced GPUDirect Storage (which RAPIDS will leverage in coming releases). All are huge steps that advance the ecosystem. But it’s users like you who advance the ecosystem more than anything else. It was great to meet so many data scientists and engineers at KDD and tell them about all the GPU-accelerated software that integrates to form the RAPIDS ecosystem.

In the coming months, the RAPIDS engineering team will be out in numbers presenting and giving tutorials at local meetups, conferences, and hackathons across the globe. Notably, we’ll be at GTC DC in numbers. Join in! The RAPIDS team wants to make it easy for you to present RAPIDS content at your meetups or local events. Join the RAPIDS slack, and then join the #RAPIDS-meetup channel and let the team know how they can help support. Download the latest slides for RAPIDS content (this will be updated each release). As always, continue filing issues for feature requests or functionality that doesn’t work as expected, and please star your favorite repo.

And here’s a special treat: new video walkthroughs on the RAPIDS YouTube channel! There are several more in the works, so subscribe and enjoy!

Footnotes:

[1]: Single-precision, 3e08 rows, 50 features, 10 clusters, 16 V100–32GB total across 2 DGX-1 nodes.

[2]: 800 trees, max depth=16, 8x V100–32GB DGX-1. 3.5s GPU vs. 202s sklearn on 2x20 core Xeon @2.2ghx. Scikit-learn 0.21 with n_jobs=-1.

[3]: Tested on a DGX-1, 2x20 core Xeon 2698v4 vs. V100–32GB GPU on the “airline” gbm-bench dataset using cuML 0.9 and XGBoost 0.90.rapidsdev1. Model with 1000 trees and max depth of 10.