Two Years in a Snap — RAPIDS Release 0.16

Published in

RAPIDS AI

8 min readNov 2, 2020

About two years ago, I was headed home from Munich, exhausted. The team, only about 20 people at the time, was exhausted. Like the post-credits scene in Avengers, our table of superheroes ate in silence and reflected on the battle. We won. We had the first major victory in saving the data science multiverse from the otherworldly forces of single-threading, code complexity, and waiting. We had launched RAPIDS.

It feels like it’s been 16 different movies since then; all different, all connected. But we are nowhere near our endgame. This was on full display at NVIDIA’s Fall GTC 2020. We had 23 talks about various aspects of RAPIDS. RAPIDS can go big, which was demonstrated in talks like how RAPIDS and BlazingSQL are powering COVID-19 drug discovery on the Summit supercomputer at Oakridge National Laboratory (new blog from ORNL here). RAPIDS can go small, running on embedded devices at the edge. RAPIDS has flow, as our two talks on streaming shows. We’re in the cloud, we’re on the ground, and we’re processing sound.

Every new parent has been warned about the “terrible twos,” but with RAPIDS, I just don’t see it happening. We’re entering the terrific twos. Since our last release, we’ve done some blockbuster work. We have some of the most modern NLP techniques working at scale and faster than ever before. Our team won the 2020 RecSys challenge with a GPU-accelerated pipeline that runs end-to-end in two minutes and eighteen seconds, even the folks at Twitter were impressed. We’re teaching people how to build out hyperparameter optimization on AWS with a great new tutorial video. For all our graph fans out there, we wrote about how you can get a speedup as big as 328x by changing a line of code. NetworkX graphs and DiGraph objects are now supported objects in cuGraph.

Let me briefly recap some of the major updates in this release.

RAPIDS cuDF

As always, cuDF was really busy. We closed 71 bugs and made 97 improvements to the code. In terms of new features, I’m excited we have implemented DataFrame.pivot() and DataFrame.unstack(), methods I relied on earlier in my career. To make feature engineering even easier, we now have dayofweek as part of our DateTime features. Before, you had to roll your own method, not the simplest thing to do. We also added the initial support for struct columns for even more nested type support. Additionally, we kicked off a refactoring effort on the IO readers and writers to allow us to build more features and fix bugs faster moving forward.

RAPIDS cuML and XGBoost

cuML made a big push to expand support for new preprocessing methods, with the addition of Dask LabelEncoder, distributed tf-IDF, Porter stemming for NLP, and a new, experimental preprocessing module. The preprocessing module includes support for much of the (large) Scikit-learn preprocessing API. While it is still a work in progress, we hope that users can take advantage of the preview to begin moving more complex ML pipelines all onto the GPU.

RAPIDS 0.16 also includes a snapshot of DMLC XGBoost that incorporates the new GPUTreeSHAP library to accelerate SHAP-based model explanations. SHAP incorporates ideas from game theory to compute well-justified, additive analyses of how each feature contributes to a given prediction, and it can be extended to compute the impact of interacting features as well. In many cases, data scientists have avoided SHAP, as it could be computationally-expensive. Now, with GPUTreeSHAP, a single GPU can compute explanations 20x faster than a 40-core CPU node, and the speedups grow even greater when analyzing interactions. Check out this simple demo to get started or dive in deeper with this draft research paper.

TPOT is now accelerated with cuML and XGBoost. TPOT users can accelerate their AutoML pipelines with the new, natively supported “TPOT cuML” configuration by changing only one parameter in their code — pass “TPOT cuML” to the config_dict argument of your TPOTClassifier or TPOTRegressor instead of leaving it as None. Accelerated TPOT found pipelines with 2% higher accuracy on the Higgs Boson dataset and 1.3% higher accuracy on the Airlines dataset for the same time budget. For both dataset samples, the cuML configuration achieved higher accuracy in one hour than the default achieved in eight hours. Look for a blog that goes into greater detail in the coming days.

RAPIDS cuGRAPH

In release 0.16, cuGraph kicked off three major long-term themes. The first is to go big. We have shifted to a new 2D data model that removes the 2 billion vertex limitation and offers better performance and scaling into the 100TB+ graph range. The first multi-node multi-GPU (MNMG) algorithms updated to use the new data model are PageRank, SSSP, BFS, and Louvain.

The second theme is to go wide, by expanding our supported input data models. In 0.16, we are happy to announce that NetworkX graph objects are now valid data types into our algorithms. We are still expanding interoperability between cuGraph and NetworkX and moving to support CuPy and other data types. The last theme is to go small and develop a collection of graph primitives. The primitives support both single GPU and multi-GPU workflows and allow us to have a single code base for both.

RAPIDS Memory Manager (RMM)

RMM 0.16 focused on reduced fragmentation for multithreaded usage, and CMake improvements. This release includes a ton of CMake improvements from contributor Kai Germaschewski that make it easier to use RMM in other CMake-based projects (and more improvements to come!). It also includes a new arena memory resource that reduces fragmentation when many threads share a single GPU, as well as improvements to the pool memory resource to reduce the impact of fragmentation. Another new memory resource is the `limiting_resource_adaptor`, which allows you to impose a maximum memory usage on any `device_memory_resource`. We have improved diagnostics with debug and trace logging, currently supported in the pool_memory_resource. A new simulated memory resource allows running the RMM log replayer benchmark with a simulated larger memory, which can help with diagnosing out-of-memory errors and fragmentation problems. And last, but definitely not least, by removing previously deprecated functionality, librmm is now a header-only library.

RAPIDS Dask-cuDF

For the 0.16 release, Dask-cuDF added an optimized groupby aggregation path when applying many aggregations. Previously, for each aggregation operation, dask-cudf would run serially against the groupby object. Now, Dask-cuDF will call the aggregation operations in parallel on the GPU. This is a big step forward for performance.

RAPIDS cuSignal

cuSignal 0.16 focuses on benchmarking, testing, and performance. We now have 100% API coverage within our PyTest suite — ensuring that deployed features are numerically comparable to SciPy Signal. Further, via our performance studies, we found multiple functions that were better suited to ElementWise CuPy CUDA kernels versus standard CuPy functions — resulting in 2–4x performance gains over cuSignal 0.15.

BlazingSQL

It’s now easier than ever to get started with RAPIDS and BlazingSQL. You can now find BlazingSQL containers on the RAPIDS Getting Started Selector, and we have expanded Blazing Notebooks to include more RAPIDS packages (CLX and cuXfilter) with a multi-GPU private beta slated for public release in early November.

For version 0.16, we have been working hard closing out dozens of user-submitted issues. At the same time, we have been working on a major overhaul of the communications layer in BlazingSQL. SQL queries are shuffle-heavy operations; this new communication layer, (soon to be merged into the 0.17 nightlies) increases performance across 95% of workloads while setting us up to utilize UCX, enabling the technologies of NVIDIA’s NVLink, Mellanox Infiniband, etc. for even greater performance.

NVTabular

NVTabular provides fast on GPU feature engineering and preprocessing and faster GPU based data loading to PyTorch, Tensorflow, HugeCTR, and Fast.ai, speeding up tabular deep learning workflows by 5–20x when used in conjunction with NVIDIA AMP. Since its inception, it has relied on RAPIDS cuDF to provide core IO and dataframe functionality.

With the recent 0.2 release, NVTabular is now even more integrated with the RAPIDS ecosystem, switching from a custom iteration back end to one built entirely on Dask-cuDF. This means users can now pass Dask-cuDF or cuDF dataframes as input in addition to the many file formats already supported by cuIO and can mix and match between NVTabular and Dask-cuDF seamlessly, writing custom ops for NVTabular directly in Dask-cuDF. It also allows for easy scaling across multiple GPUs or even multiple nodes; In a recent benchmark, we were able to preprocess the 1.2TB — 4 Billion row Criteo Ads dataset in under 2 minutes on a DGX A100.

RAPIDS version 0.16 introduces list support in cuDF which allows for the addition of NVTabular’s most requested feature, Multi-hot categorical columns, and the team is hard at work on that for NVTabular version 0.3.

CLX

While there were multiple performance improvements and tweaks to the example notebooks in CLX, making them more performant and easier to use, the big enhancements come to cyBERT. In addition to the on-demand batched mode previously supported, cyBERT can now utilize a streaming pipeline for continuous, inline log parsing. In addition, cyBERT has been modified to support ELECTRA models in addition to the previously supported BERT models. While BERT is still preferred for the log types we’ve observed thus far (providing higher parsing accuracy albeit at a slightly slower speed), ELECTRA support will allow others using cyBERT greater flexibility in choosing a model that works for them in their network environments. cyBERT also got a few more tweaks and improvements, including a new data loader that helps keep up with larger streaming pipelines.

RAPIDS Community

We’re starting a podcast called RAPIDS Fire. The first episode releases in early November, so keep your eyes out for the announcement. We’d love to have your feedback on topics and invite the community to join the dialogue. The format is going to be unique like RAPIDS is unique. There’s going to be a host, Paul Mahler, and rotating co-hosts. I’m up first. The two hosts will interview a guest on anything and everything related to accelerated data science and the RAPIDS Community. We’re really pumped about this, so expect a blog announcing the first episode. We expect the podcast to be available anywhere podcasts are found.

Wrap Up

RAPIDS has made so much progress in two years I almost can’t believe it myself. We have a lot of exciting new things on the way in version 0.17, more improvements to SHAP, more MNMG algorithms in cuGraph, and nested type support in cuDF will continue to improve. As always, find us on GitHub, follow us on Twitter, and check out our documentation and getting started resources. We’re excited to have you join us, and we’re looking forward to another great year of RAPIDS.