RAPIDS Going the Distance in .12

RAPIDS in 2020: Scale and Stability

Published in

RAPIDS AI

8 min readFeb 11, 2020

Wow, 12 releases felt like a heavyweight boxing match at times. RAPIDS came out with a plan, but after that first punch, RAPIDS learned to adapt to the ever-changing needs of our users and the open-source ecosystem. Those adaptations made us stronger, faster, more flexible, and better prepared for the road ahead. Where does that road lead? To more scale, and of course, 1.0. There’s much more to say about 1.0 in the 0.13 and 0.14 releases, so today, let’s talk about scale.

Our 0.12 release involved a ton of under the hood refactoring — from RAPIDS core to individual libraries. While somewhat painful (taking punches is never fun) it’s been worth it. The RAPIDS libraries are more interoperable and more performant, all while ensuring minimal impact to users. Python users should just see everything go faster with no changes to their code. The RAPIDS 0.11 Release blog detailed the extensive work being done to refactor the lower-level C++ libcudf library (internally named “The Great libcudf++ Refactor.”) This port is now close to complete, and that means the “Great cuDF++ Refactor” begins (the ++, in this case, means other languages, “++”, not C++). This tight integration of lower-level C++ libcudf and Pythonic cuDF will greatly improve performance and ease of development. Also, the libcudf++ refactor allows improved interoperability with BlazingSQL and Java. The cuML team is focused on building multi-node, multi-GPU (MNMG) algorithms to scale to larger datasets and adding pickling and model object cloning functionality for improved usability. cuGraph is also finishing up their refactoring efforts to improve interoperability with cuDF and cuML, as well as accelerate the development of new scalable MNMG algorithms.

All of this allows 0.13 to be an amazing release. After slugging it out for 12 rounds, RAPIDS is more battle-hardened and ready to take on bigger data at scale. NVIDIA’s GPU Technology Conference (GTC) coincides with the 0.13 release, and while we will blog about all the new features and performance, let’s just say you’ll want ringside seats to meet the team and hear all the updates first hand. Now let’s go blow-by-blow through the improvements in the 0.12 release.

Core Library Updates

cuDF

cuDF 0.12 focused on continuing “The Great libcudf++ Refactor,” and we’re proud to say it’s nearly done. Key APIs, such as join, sort, sort-based groupby, and the majority of string functions, have been ported to the new libcudf++ APIs and data structures. In addition to porting existing APIs, there are a few new features in libcudf: a new GPU-accelerated parquet writer, left semi-join and left anti-join implementations, and a clamp implementation.

On the Python side, 0.12 kicked off “The Great cuDF++ Refactor”, a refactor of the Python internals to more closely align with the libcudf++ library and improve BlazingSQL and Java integration. Key accomplishments so far include new internal classes that are the basis for all computation in cuDF (Buffer, Column, Table, and Frame) while allowing easier interoperation with the Python CUDA ecosystem. There are also a few new features, including CuPy 7.0+ support, DataFrame.memory_usage() and Series.memory_usage(), as well as is_nan() and is_notnan() APIs.

Version 0.12 includes improvements to the RAPIDS Memory Manager (RMM) to expose memory usage info to Python, copy device_buffer objects to host memory and improve memory lifetime semantics.

cuML

cuML 0.12 focused on building out the framework and core code for additional MNMG algorithms as well as adding key quality of life features requested by users.

cuML’s K-Nearest Neighbors models now support MNMG scaling. Kaggle Grandmaster Chris Deotte demonstrated in this blog post that even the single-GPU RAPIDS kNN algorithm can deliver speedups of up to 600x for real-world problems, and MNMG support allows users with huge datasets to take advantage of this speed as well. This release also adds Support Vector Regression (SVR) to complement the Support Vector Classification model already in cuML.

Random Forests, one of the most popular supervised learning models, previously consumed a large amount of memory for deep trees with its layerwise tree building algorithm. Now, the algorithm in cuML switches to an alternative batched algorithm as depth increases, allowing it to support extremely deep trees. This pairs well with the Forest Inference Library’s support for sparse forests, which allows users to infer on extremely deep trees as well.

*Figure 2. Benchmark results for FIL (dense and sparse trees) and SKLearn*

cuML is working to ensure that every model can be smoothly pickled and cloned to support both persistence and integration with hyperparameter sweep frameworks, which often rely on model object cloning. Most models are already covered in 0.12, and the few remaining stragglers will get full support and testing for the 0.13 release.

cuGRAPH

cuGraph 0.12 focused on wrapping up most of the graph refactoring effort, with only one remaining item rolling into 0.13. This release also included updating code based on the cuDF refactor and improvements to our test cases and coverage. MultiGraph is now included, adding to the supported list of graph types.

The cuGraph team did have time for a few enhancements and additions. Ensemble Clustering for Graphs (ECG), which is a variant of Louvain that aims to address some of its resolution limits and reproducibility shortcomings, is now available. cuGraph’s Louvain algorithm was expanded by adding a feature to set the maximum number of iterations to run (max_iter parameter) Some algorithms, such as ECG, need to access the intermediate result of Louvain at a specific level.

The team also vastly improved the renumbering feature, allowing for both auto-renumbering (and un-renumbering) and multi-column renumbering. When data is added to a graph, the underlying code automatically renumbers the columns so that the starting vertex ID is zero and all subsequent values are contiguous. Renumbered vertex IDs are automatically converted back to the original value when necessary. Renumbering now supports multi-column specification for both the source and destination addresses, and those columns can be of any data type. Some examples include a vertex being a Name (string) + Zip Code (integer) or, for cybersecurity data, an IP address (string or int64) and a port number (int).

Ecosystem Updates:

Dask

Release 0.12 includes many usability and bug fixes for Dask and Dask-cuDF. Many of these issues often go unreported, and users are pushed into an awkward position of working around what is often a bug. Specifically, Dask removed the need for a distributed sort for set_index operations when columns are pre-sorted, improved parquet reading, and fixed groupby aggregations when using MultiIndex columns. Dask added the Pandas equivalent method memory_usage to high-level objects like Dask-cuDF DataFrames and Series. Gathering memory usage statistics, often in exploratory sessions, can be invaluable when debugging and optimizing new workflows. For all you viz fans, Dask added a new visualization in Dask Distributed dashboard.

This new cluster map dashboard helps to visualize transfers between workers and overall cluster communication. It’s a welcome addition and is sure to be useful as RAPIDS continues enhancing integration between Dask and UCX.

BlazingSQL

BlazingSQL 0.12 has been all about stability and scale. With this release, BlazingSQL integrates the long-promised capability of data-skipping. This means WHERE clauses in your SQL statements will selectively filter and load Apache Parquet row groups during query execution based on Parquet’s metadata. This substantially reduces the amount of data loaded into memory and enables users to work on even bigger workloads.

V0.12 also allows users to configure how BlazingSQL uses the RAPIDS Memory Manager (RMM). This improves performance since BlazingSQL will now pre-allocate a memory pool, improving query performance. Also, when queries are larger than the available GPU device memory BlazingSQL now spills to system memory.

The BlazingSQL team has been hard at work integrating BlazingSQL with “The Great libcudf++ Refactor”. This refactor simplifies the BlazingSQL codebase, makes the engine faster and reduces the memory footprint of queries with Strings. This refactor will be released in V0.13, alongside support for data-skipping on more file types, as well as expanding the SQL coverage of the engine.

cuxfilter

Cuxfilter has gone through a massive refactor, which you can read more about on our Data to .dashboard() post, and is now available as part of a RAPIDS installation. Additionally, with DataShader adding GPU support, we’ve included it and deck.gl to cuxfilter’s available charts. So now is a great time to try cuxfilter out.

CLX

Speaking of cuxfilter, here’s a recently posted blog on how to perform alert analysis at-scale using GPUs, including GPU-accelerated visualizations with cuxfilter. You can read the blog and then try it out for yourself with a documented Jupyter notebook example.

*Figure 5: Visualizing co-occurrence of cybersecurity alerts using cuxfilter*

Community Updates:

The RAPIDS team will be out in numbers this year at NVIDIA’s flagship GTC San Jose conference. We invite you to join us for the Accelerated Data Science track full of tutorials, sessions, and socials covering how you can get started using GPUs for data science. Plus, it’s an excellent opportunity to meet with NVIDIA engineers. If you can’t make GTC, find us at PyCon US where we have a workshop tutorial, booth, and a few sessions.

Summary

Release 0.12 is setting up RAPIDS for 0.13, which will be a major release. All the refactoring and hardening in 0.11 and 0.12 is to allow users to scale to much larger problems with 0.13 and beyond. After 0.13 and all the excitement of GTC, 0.14 will be a whole release focused on quality of life improvements. In 0.14 RAPIDS will refine its docs, continue to work with the community on integration, push down its bug count, expand its C++ examples, and add more tests in its CI/CD system. In 0.14 and beyond, we will focus on stability at scale, hardening features, and preparing for 1.0.

As always, we want to thank all of you for using RAPIDS and contributing to our ever-growing GPU accelerated ecosystem. Please check out our latest release deck or join in on the conversation on Slack or GitHub.