RAPIDS 0.13: Ready To Launch!

All Systems Go

Published in

RAPIDS AI

12 min readApr 6, 2020

As I look back on release 0.13, I feel a fraction of how the astronauts on Apollo 13 must have felt: relieved to have landed and grateful for rooms of quick thinking engineers. Early space missions can teach you a lot about what to do when things don’t go according to plan. You have to keep a level head, follow a checklist, and efficiently improvise solutions when plans A, B, and C didn’t work.

When we started building open source tools for GPU data science, we had a vision of making the most advanced computing resources available to everyone. To quote a speech JFK made about the race to the moon, “We do these things not because they are easy, but because they are hard.” Let me tell you about some of the hard things the team did for this release to move our vision forward.

The Great cuDF++ Refactor Gets Official Lift-off

I mentioned in the 0.12 release blog that we kicked off “The Great cuDF++ Refactor”. We have made huge progress on this work for release 0.13. Nearly all the code base has been ported to use the libcudf++ API. That means we are on a firm foundation for multi-language, multi-platform RAPIDS for the foreseeable future. Such a huge refactor can lead to bugs, even when carefully undertaken by the best engineers. So if you notice any problems, please file GitHub issues so we can resolve these quickly.

Improvements from the refactor shown from 0.10 to 0.13

RAPIDS Core Library Updates

Given everyone has more to think about these days given the current state of the world, we’re going to split our release blog into two parts. Many of you went from one to three plus jobs as you’re spending more time at home (and hopefully practicing social distancing), and reading a long release blog may not be top of mind. You can read the summary version here, or continue with the detailed, long-length blog below.

Dataframes: cuDF (by Keith Kraus & Mark Harris)

0.13 is another very big release for cuDF. A primary goal of 0.13 for libcudf was to finish remaining tasks in the “Great libcudf++ Port” that remain from release 0.12. This includes a new unified groupby implementation that enables automatic selection between hash-based and sort-based grouping algorithms depending on the aggregations requested, as well as memoization of some results for faster computations of compound aggregations. We also ported a few key strings operations and algorithms from the NVText library.

This release also sees many new features in libcudf, including a preview of a new dictionary column type. While this type is not supported across all algorithms in libcudf yet, support is underway including gather and scatter implementations. 0.13 also includes the beginnings of a new fixed-point Decimal type that we will be adding support for in future releases. Other new algorithms include rank() for columns (cuDF series) and tables (cuDF dataframes), groupby argmin/argmax (called idxmin/idxmax in Python) and nth and nunique aggregations, column, and table shift(), and sequence() for generating a new column filled with a sequence of values. On the I/O side, 0.13 includes new support for writing very large ORC and Parquet files split into chunks, as well as support for writing column-level statistics in ORC files.

Finally, there are several optimizations and new APIs aimed at improving performance. Concatenating many tables or columns is up to 2000x faster. There is a new k-way sorted merge that can efficiently merge the rows of multiple sorted tables. There is a new multi-column quantiles implementation that more efficiently computes the quantiles for the columns of a table. Finally, partitioning has been refactored, and the hash partition implementation is faster.

Another major goal of release 0.13 has been to reorganize the Python side of cuDF to take advantage of the new APIs provided by the libcudf++ refactor. One of the most exciting aspects of the libcudf++ refactoring is that strings are supported as a native, optimized data type (the former external library NVStrings is now deprecated). This adds many more string functions following the Pandas API. The new groupby implementation described above is now exposed in the Python API, adding more aggregations, including median, nunique, nth and std. Join support is expanded, adding semi-joins and anti-joins. And as always, this release improves Pandas API compatibility and adds some additional optimizations to make cuDF as fast and easy to use as ever.

From the dask-cuDF side, this release adds some key features and performance improvements to groupby, partitioning, sorting, and serialization. Key highlights include a new multi-column sort implementation underneath sort_values and a new repartitioning implementation that can repartition based on column values. Additionally, we’ve cleaned up the serialization code which makes communicating GPU-based objects as seamless and efficient as possible.

Machine Learning: cuML and XGBoost (by John Zedlewski)

RAPIDS 0.13 is the first release to ship with XGBoost 1.0. For over five years, XGBoost has been a critical part of many data scientists’ toolkits, and the RAPIDS team is thrilled to support this project and its GPU acceleration. XGBoost 1.0 includes the first official release of the new Dask API, first-class support for both cuDF dataframes and CuPy arrays, GPU-accelerated learning-to-rank, and a new, robust serialization framework. RAPIDS will continue to distribute rapids-xgboost conda packages based on snapshots from the upstream XGBoost repository tested to be compatible with the version-matching RAPIDS libraries.

cuML, the RAPIDS library for accelerated machine learning, continues to add features with a focus on scaling and user experience. For users with huge datasets, cuML 0.13 now includes new distributed, multi-node and multi-GPU linear models: OLS, Ridge regression, PCA, and tSVD. Try fitting a regression on a dataset with hundreds of millions of rows in under two seconds on a single 8-GPU machine, then scale that even further to billions of entries on a cluster and see how a 20x or greater training speedup can change your workflow. cuML will continue expanding these models with additional solvers, Lasso and ElasticNet models, and memory consumption reductions over the next two releases.

Instead of scaling up to one large model, hyperparameter optimization workloads require scaling out to run many models in parallel. If you’re running hundreds or thousands of model variants, GPU-acceleration can transform these from overnight jobs to nearly interactive workloads. While cuML has been useful for these workloads before, cuML 0.13 added many missing features and optimizations to make the hyperparameter optimization process seamless, including support for model cloning and serialization and many new scoring and metrics functions to keep your evaluation pipeline all on GPU. These all come together in a simple Dask-ML hyperparameter optimization notebook demo.

While benchmarks and scaling are critical for many users, developers shouldn’t have to give up productivity or simplicity to get speed. cuML has long supported three different input formats: cuDF dataframes, GPU arrays (from cuPy or other cuda_array_interface libraries), and NumPy arrays on CPU. While users loved this input flexibility, it was often confusing to understand the output formats of various models. cuML 0.13 brings a new Base class for estimators to make outputs much more flexible. By default, all outputs of Base-derived estimators match the input type of the data passed to fit. So if you fit on a cuDF dataframe, your results will come out as cuDF DataFrame or Series objects. You can also configure this behavior globally — for instance, you can force all estimators to return NumPy arrays if you want to maximize compatibility with legacy CPU-based code. Many estimators use this new base class already, and release 0.14 will bring universal support for the new approach.

Finally, many smaller features were added based on user feedback and priorities. ARIMA time series models now support seasonality. Many more classifiers now support the predict_proba interface. Multi-node, multi-GPU Random Forest inference now seamlessly uses the forest inference library on GPU. All models now support pickling. DBSCAN and UMAP performance improved significantly. And bugfixes were added throughout the stack. Please keep filing feature requests and issues on cuML! The team loves incorporating user feedback.

Graph Analytics: cuGraph (by Brad Rees)

You never realize just how much code has been created until you need to go back over every function and refactor it. What you think is only a few weeks’ worths of work quickly turns out to be a much bigger problem. Luckily, we are winding up that long cuGraph refactoring journey. The refactoring includes:

Removing several dependencies in the C++ code so that libcugraph is more of a stand-alone library,
Modernizing the C++ API,
Organize C++ code into a “cuGraph” namespace,
Updating all the error messages to be more meaningful, and
Fixing a lot of bugs.

We have also continued the work of building out a collection of Graph classes and adding the basic functionality to each class. Currently, it is just (undirected) Graph and (directed) DiGraph, but more are coming. There is still some more cleanup to do, and lots of documentation to improve. For a more in-depth discussion on the refactoring, see the latest cuGraph blog.

We never like to stand still, so two new features were added:

Betweenness Centrality (BC) is a key metric for measuring the influence (importance) of vertices based on the amount of information that flows over the vertex. The basic definition is that the BC score is equal to the number of shortest paths that flow over a vertex.
The other algorithm is K-Truss, and weighted K-Truss, for community/cluster detection. K-Truss is a variation of relaxed clique detection — a clique is a cluster where every vertex is connected to every other vertex in that cluster. The inventor of K-Truss, Jon Cohen, said that the problem with cliques is that they are both too rare and too common. As a community gets larger, the chances that everyone is fully connected goes down. K-Truss allows for the identification of large cliques with missing edges.

Data Visualization: cuXfilter (by Allan Enemark)

The previous release was a big one for cuXfilter, so we spent 0.13 doing some polishing. Mostly we focused on updating and improving our documentation and improving package import time (it is about 4x faster now). We also added deck.gl charts as the default choropleth map, which has both 2D and 3D capability. We also removed the library names from our API to simplify chart creation defaults. The viz team is working on some big things coming up, so stay tuned.

Geospatial Analysis: cuSpatial (by Thomson Comer)

This release, cuSpatial joins the rest of the RAPIDS project docs at https://docs.rapids.ai. All cuSpatial docs are now parsed and rendered to HTML for 0.13 and nightly builds.

This release also produced the culmination of an important customer request: trajectory interpolation. We’ve added batch cubic spline interpolation for trajectories and other curves. cuSpatial utilizes NVIDIA’s cuSparse library for fast parallelized tridiagonal matrix inversion. This is typically the most expensive operation when computing cubic splines. Cubic spline interpolation is useful for computing actual positions from incomplete telemetry data in large datasets. Running on a single NVIDIA Tesla V100 GPU, cuSpatial can interpolate 1 million 5-point splines up to 5700x faster than scipy.interpolate.CubicSpline (running on a single core of an Intel Xeon E5–2698).

Signal Processing: cuSignal (by Adam Thompson)

I’d like to welcome the newest RAPIDS library to the party! cuSignal is a GPU-accelerated signal processing library that leverages and expands on the popular SciPy Signal API. With the emergence of AI and ML applications to sensor data — like wireless communications for 5G networks, EEGs and EKGs for the medical industry, and large-scale FFTs for oil and gas — there’s an increasing need not only for fast signal processing building-blocks but also for exposing this speed in Python and seamless data handoff to ML/DL frameworks. Unlike other RAPIDS projects, cuSignal currently lacks a C++/CUDA software layer and instead achieves GPU acceleration through the exclusive use of Python. Much of the codebase is written using CuPy as a drop-in replacement for NumPy and where additional speed is needed, cuSignal leverages both raw CUDA CuPy and custom Numba CUDA kernels.

Even if you’re not interested in signal processing, this is an exciting new way to look at building GPU-accelerated libraries without touching C++. The best part? cuSignal, Numba, and CuPy each support the __cuda_array_interface__, so you can build a signal processing workflow that streams data in, resamples, performs spectrum estimation, and then zero-copies that data to PyTorch — all without leaving the GPU.

cuSignal typically achieves 1–2 orders of magnitude speed improvement over CPU implementations, many of which have been optimized in C. Further, it’s been tested and deployed on NVIDIA Jetson embedded GPUs — from Nano to Xavier. In fact, Deepwave Digital is helping improve cuSignal’s streaming signal processing support on their AIR-T device that includes a Jetson TX2.

The following table shows selected speedups on a 100-million-sample float64 signal on a V100:

For more information about cuSignal, please visit both the initial release and version 0.13 blogs.

RAPIDS Community Updates

Communication: UCX-Py (by Benjamin Zaitlen)

The 0.13 release of UCX-Py brought a close to many outstanding bugs and usability issues when leveraging NVIDIA NVLink. Additionally, we’ve been iteratively refactoring the codebase to provide end-users with a clean separation between a low-level interface for UCX and a high-level semantics for asynchronous communications. Lastly, we have been focused on resolving multi-node, multi-GPU InfiniBand tests and the early results are very promising:

The plot above shows the bandwidth when merging two columns with random keys (30% matching) on an NVIDIA DGX-1. Because of the topology of a DGX-1 (two sets of 4 GPUs connected by Infiniband (IB) and PCIe devices), we see that while NVLink is faster than TCP, IB alone is very beneficial. Additionally, mixing both InfiniBand and NVLink results in superior performance.

BlazingSQL (by Rodrigo Aramburu)

The BlazingSQL 0.13 release focused on stability and bug fixes as the team geared up for a major feature release slated for 0.14. With 0.12, BlazingSQL released the first stable version of BlazingSQL running on top of the improved libcudf++. Such a large refactor meant we dedicated a significant amount of time working on bug fixes we received from community feedback and internal testing.

SQL coverage also improved since some of those bugs had knock-on effects on previously unsupported SQL statements. Release 0.13 now supports round(), case with Strings, and for distributed queries ag() support has been added.

All of this bug fixing happened in conjunction with a new feature initiative called “Bigger than GPU,” which effectively allows SQL queries that don’t fit into the available GPU memory to execute. How it works will be explained in more detail during the 0.14 release, but this project is already far along and will be the default Nightly build in a few weeks.

Machine Learning in Python

During the last few releases, Sebastian Raschka, Corey Nolet, and I wrote a survey paper, Machine Learning in Python: Main Developments and Technology Trends in Data Science, Machine Learning, and Artificial Intelligence. Here’s the abstract:

Smarter applications are making better use of the insights gleaned from data, having an impact on every industry and research discipline. At the core of this revolution lies the tools and the methods that are driving it, from processing the massive piles of data generated each day to learning from and taking useful action. Deep neural networks, along with advancements in classical machine learning and scalable general-purpose graphics processing unit (GPU) computing, have become critical components of artificial intelligence, enabling many of these astounding breakthroughs and lowering the barrier to adoption. Python continues to be the most preferred language for scientific computing, data science, and machine learning, boosting both performance and productivity by enabling the use of low-level libraries and clean high-level APIs. This survey offers insight into the field of machine learning with Python, taking a tour through important topics to identify some of the core hardware and software paradigms that have enabled it. We cover widely-used libraries and concepts, collected together for holistic comparison, with the goal of educating the reader and driving the field of Python machine learning forward.

The article was recently published in Information. I want to thank my co-authors for all their hard work. NVIDIA GPUs are the gold standard of machine learning and data science. We’re excited RAPIDS is advancing data science, uniting the developer ecosystem, and allowing end-users to efficiently use GPUs to do more than ever.

The Wrap Up

Release 0.13 was major, which means release 0.14 will focus on quality of life improvements. In 0.14 RAPIDS will refine its docs, continue to work with the community on integration, push down its bug count, pay down some technical debt, and add more tests in its CI/CD system. In 0.14 and beyond, we will focus on stability at scale, hardening features, and preparing for 1.0. RAPIDS was also aiming for a few major announcements in 0.13, that didn’t make our deadline. Don’t worry, they’ll be bigger and better in the future. RAPIDS will continue to improve, but given the present time, many on the development team have a few additional responsibilities at home (we are sure this is common in the RAPIDS community and beyond as well) and getting through this global struggle is our top priority.

As always, we want to thank all of you for using RAPIDS and contributing to our ever-growing GPU accelerated ecosystem. Please check out our latest release deck or join in on the conversation on Slack or GitHub.