RAPIDS Up to 0.11: It’s One Faster

Published in

RAPIDS AI

9 min readDec 20, 2019

More Speed, Scale, and Reliability

As 2019 comes to a close, the RAPIDS team couldn’t end the year without one last release. The 0.11 release not only prepares us for 2020, but it also follows the theme of previous releases making end-to-end data science on GPUs faster, more reliable, and more scalable. Given this is the last blog of the year, I would normally do a recap, but the 0.10 “Anniversary Blog” and the Thanksgiving “Thank you Blog” already did that. So, let’s just crank up the volume and jump into the music.

RAPIDS Core Updates:

cuDF

At the core of cuDF is the C++ library libcudf. It provides CUDA-accelerated C++ implementations of all of the data frame operations that cuDF supports, and it serves multiple clients, including the cuDF Python API, a Java / JNI API, and direct users of its C++ API such as BlazingSQL. These multiple clients drive a variety of requirements for libcudf, and before 0.11, the libcudf core data structures were making it difficult to expand functionality to meet requirements. The core structure, gdf_column suffered from several problems:

gdf_column was a “plain ‘ole data” (POD) structure with no abstraction.
Memory was sometimes allocated outside the library (e.g. by Numba) and assigned to gdf_column, and sometimes allocated inside the library.
The ownership and lifetime of memory pointers assigned to gdf_column were not clearly defined.
libcudf APIs typically took output gdf_column’s as parameters and were responsible for allocating memory, but the caller was then responsible for the lifetime of the column and freeing its memory.
gdf_column only supported flat columns of fixed-width data.
Supporting string data required relying on a separate library, nvstrings.

Solving these problems has been a major undertaking of the libcudf team in releases 0.10–0.12. Internally, we’ve been calling the effort “The Great libcudf++ Port”, to reflect the C++ modernization of the library. Starting in 0.10, we introduced new core column and table data structures to replace gdf_column, and 0.11 focused on porting all of the cuDF algorithms over to use them (some of this work continues in 0.12). To complete this work, the libcudf team has expanded and borrowed engineers from other RAPIDS teams to help. We estimate that this expanded team completed over two engineering years of work in about two months. cuDF 0.11 includes 3,199 commits which together added, removed or changed nearly 100,000 lines of code.

“Libcudf++” has been a big investment, and a lot of work, but we are already reaping the benefits. The new cudf::column structure is hierarchical, meaning that columns can have child columns. This has enabled adding a new native strings column type which is supported throughout the library. In future releases, it will also enable a variety of important data types, including dictionary columns, fixed-point decimal columns, and columns of lists and structured data.

Just as important, new cudf::column and cudf::table classes provide a clean C++ abstraction, clearly distinguishing memory-owning columns from non-owning column_views of the data, and column_device_views which can be passed directly into CUDA kernel functions. These abstractions make building, testing, and maintaining new algorithms easier and more robust. Libcudf is now ready to grow for the future and to support the demands of multiple clients.

Following libcudf’s lead, cuDF Python development spent a lot of effort in the 0.11 release in a refactoring effort which will materialize in 0.12. But there’s still a bunch of exciting new features, including:

an ORC writer,
inplace support for drop and reset_index,
DataFrame level typecasting,
row-wise reduction and scan operations,
groupby standard deviation,
DataFrame covariance and correlation,
tighter CuPy integration,
more MultiIndex support, and
numerous optimizations and bug fixes across the library.

cuML

The cuML team is focusing on three major themes:

scale-out algorithm support,
improving tree-based models, and
top user-requested new models.

To allow users to scale to larger data sizes, cuML has been expanding its support for scale-out, Multi-Node / Multi-GPU (MNMG) algorithms since cuML 0.9. You can read more about the earlier algorithms in our K-Means and Random Forests blog posts. Release 0.11 adds two new MNMG dimensionality reduction algorithms — Principal Components Analysis (PCA) and Truncated Singular Value Decomposition (tSVD). Stay tuned for further MNMG algorithms in upcoming releases, with K-nearest neighbors and linear regression models coming soon. All of these algorithms build on the shared cuML MNMG infrastructure, with its backbone of high-performance NCCL and UCX communications to allow direct GPU-to-GPU data transfers.

Data scientists spend their days waiting for tree-based models, including random forests and gradient-boosted decision trees, to train. So cuML continues to improve its tree models with each release. The Forest Inferencing Library (FIL) now supports sparse tree storage, so even very deep trees can fit in GPU RAM. FIL can import trained models from scikit-learn’s RandomForest module as well as those from XGBoost, cuML, and LightGB. cuML’s own Random Forest models can now be persisted with standard Python pickling, making them much easier to deploy.

Inspired by user requests, cuML is expanding its time-series functionality with the first experimental release of ARIMA (Auto-Regressive Integrated Moving Average) models. ARIMA is one of the most popular time series modeling and forecasting techniques, and cuML can now use it to process thousands of time series in parallel. Upcoming releases will add support for seasonality in ARIMA as well as improving performance and accuracy.

cuGRAPH

During the 0.11 release, cuGraph kicked off a major redesign and refactoring effort which will finish with the 0.12 release. The redesign includes the addition of a new Graph class to capture the type of graph: Directed (DiGraph) or Undirected (Graph). That class is further expanded in 0.12 to include Multigraph (both directed and undirected) and Bipartite graphs. The current set of graph algorithms is being refactored to use the new graph types and to throw helpful errors if an invalid graph type is used (e.g., weakly connected components only accepts a DiGraph). Error handling now throws exceptions rather than a status flag. The text of each error message is currently being updated to provide better indications and help the user diagnose the issue. Finally, cuGraph 0.11 includes low-level cleanup that will make adding future analytics easier.

RAPIDS Community Updates:

UCX

As RAPIDS continues to drive down computation costs, communication becomes a significant bottleneck. To address this, RAPIDS has been working with the UCX library for high-performance networking, using technologies like Infiniband and NVIDIA NVLink, and connecting this to Python. This is critical for speedups across multiple GPUs in one machine or many machines in a cluster.

A beta version of UCX-Py is available via conda or RAPIDS Nightlies. Users can easily install UCX/UCX-Py with:

conda:

conda install -c rapidsai ucx-py=0.11 cudatoolkit=<CUDA version> ucx-proc=*=gpu ucx

Or nightly builds:

conda install -c rapidsai-nightly ucx-py cudatoolkit=<CUDA version> ucx-proc=*=gpu ucx

While UCX-Py is functional, there’s plenty of work to do benchmarking and optimizing performance. Below is a current benchmark comparing TCP vs. UCX-Py when computing a distributed join with random data across several GPUs using cuDF and Dask.

UCX-Py is operating between 14–42x faster than TCP. More features will be added in 2020 (and of course fixes to any eventual bugs), and UCX-Py will eventually be upstreamed to the UCX core group.

Dask

Release 0.11 includes Dask-cudf improvements with ORC file readers/writers and a switch from custom merge/join algorithms to Dask core for merges and joins. Dask-cudf has also exposed dropna capabilities for groupby operations and enabled covariance/correlation calls. Parquet reading has improved, and Dask core has intra-file parallelism using row groups that are exposed to Dask-cudf. Dask has improved dashboarding and profiling tools, including new endpoints to measure bandwidth per data type and per worker. These dashboards are easily recorded and shared with colleagues with the new perfomance_report context manager.

Dask Cloud Provider

A new addition to the Dask family is Dask Cloud Provider. This library gives users the ability to confidently create Dask clusters on various cloud provider services using familiar Dask APIs. The goal is for users to be able to create a Dask cluster on a cloud service from within their Python session without necessarily having to set up a compute platform beforehand.

The first release includes cluster managers for AWS Elastic Container Service (ECS) and AWS Fargate. You can get up and running with RAPIDS on a GPU enabled ECS cluster with just a few lines of Python.

from dask_cloudprovider import ECSCluster

cluster = ECSCluster(cluster_arn=”arn:aws:ecs:<region>:<acctid>:cluster/<gpuclustername>”, worker_gpu=1)

The roadmap includes the addition of more cloud provider services, each making use of a native cloud provider API.

For a more in-depth introduction into using the ECS and Fargate cluster managers and to see an example of how this is being used at Capital One, see this recent blog post.

CLX and cyBERT

If you’re working in the cybersecurity domain, RAPIDS now has a repo that contains cybersecurity specific methods, example notebooks, and use cases. Cyber Log Accelerators (CLX, pronounced “clicks”) contains pre-built use cases that demonstrate RAPIDS functionality specifically built for security analysts and infosec data scientists, integration components with the SIEM, I/O and workflow wrappers, and accelerated primitive operations for IPv4 and DNS data types. Pre-built use cases include network mapping, alert analysis, and DGA detection. One of the most exciting components of CLX is cyBERT, an ongoing experiment to train and optimize transformer networks for the task of flexibly and robustly parsing logs of heterogeneous cybersecurity data. Using a trained BERT model for named entity recognition, cyBERT demonstrates that it is possible to extract high-quality parsed values from raw cybersecurity logs, even with unseen log formats and degraded log quality. For a more complete introduction, refer to the CLX and cyBERT blog posts.

Datashader

The RAPIDS viz team has been working closely with the fantastic Holoviz Datashader team to develop a GPU-accelerated version of their library. Prototyping has been done through our cuDatashader project, and now a native cuDF and GPU accelerated version has just been released with Datashader 0.9.0. For rendering 100 million points, it’s 40x faster than single-core Pandas and 4.5x faster than multi-core Dask.

NVDashboard

Monitoring GPU utilization, memory consumption, and machine resources are hard enough, made only more challenging tracking if you’re using NVLink vs. PCIe and keeping tabs on your throughput. The RAPIDS team has made this much easier with NVDashboard. Built on PyNVML (Python API for NVIDIA Management Library, NVML) NVDashboard allows users to graphically monitor system metrics, fully integrated into your notebook experience. This isn’t limited to RAPIDS workflows either. Feel free to use it with any GPU library, including deep learning, in a Jupyter Notebook.

BlazingSQL

With the v0.11 RAPIDS release, the BlazingSQL (BSQL) team pushed a ton of new changes onto their latest stable release, including a brand new Logo!

BSQL benefits from all the new features in cuDF and is also starting a refactoring effort towards the new libcudf C++ API which will also materialize in the 0.12 release. To reduce confusion, BlazingSQL is adopting RAPIDS version numbers, so the latest BSQL release is v0.11 as well (skipping v0.5–0.10). This latest version has lots of new features including a new architecture that improves stability and performance, new logging tools, the ability to integrate with Hive metastore, and tons of bug fixes and optimizations.

Fade to Black

As always, we want to thank all of you for using RAPIDS, and contributing to our ever-growing GPU accelerated ecosystem. Please check out our latest release deck or join in on the conversation on Slack or GitHub. The RAPIDS team is taking some needed time off for the holidays, but we’ll respond to your issues, feature requests, or questions promptly in the new year. The future will continue to get better, and 0.12 is already shaping up to be another impressive release. RAPIDS will have more multi-node multi-GPU algorithms, the cuDF/cuGraph refactor will be complete, and much more. See you in 2020, and look forward to another year of speed, scale, and efficiency…and fun.