RAPIDS 0.10: Things We Love Inspired By Decades of Data Science

Published in

RAPIDS AI

11 min readOct 28, 2019

During this release, RAPIDS celebrated its first anniversary, and what a year it’s been. The RAPIDS team can’t thank the community enough for all the love and support the project has received. RAPIDS cuDF crossed 2,000 stars, cuML and cuGraph combined now exceeds 1,000 stars, and RAPIDS received its first BOSSIE award. A huge thank you to everyone. The RAPIDS team will continue to push accelerated end-to-end data science to new heights.

While the RAPIDS team was discussing the 0.10 release, the team took a moment to reflect on a former Wes McKinney blog, Apache Arrow and the “10 Things I Hate About pandas.”

A quick history lesson (if you’ve been following data science for the last 15 years and know the history of all the popular libraries, skip ahead) — a decade ago, many elements of what makes up big data as it is known today emerged. In one part of the data universe, the Hadoop ecosystem was born. Hadoop, Hive, Cassandra, Mahout, and more were taking off.

In another, the PyData stack as data scientists know it, was emerging. NetworkX (2005), Numpy (2006), Scikit-Learn (2007), and Pandas (2008) ushered in a wave of usability; while Hadoop, Hive, Cassandra, Flume, Pig, and Spark scaled data science to unprecedented levels. They both flooded the data science ecosystem with new libraries, vendors, and nearly countless permutations of ways to build pipelines to solve data science problems.

With all the excitement from new tools and workflows, few stepped back to think about how all these libraries and frameworks would work together efficiently prior to Apache Arrow. Thus, most data scientists/engineers spent most of their compute serializing and deserializing data between libraries (lots of Copies & Converts).

RAPIDS was inspired by what people loved from numerous libraries, communities, and frameworks, as well as the pain people endured when using those tools at scale. Those emotions, both good and bad, led the RAPIDS ecosystem to address nearly all of the 10 (really 11) things Wes hated about Pandas and more.

The “10 Things I Hate About Pandas” List:

Internals too far from “the metal”
No support for memory-mapped datasets
Poor performance in database and file ingest / export
Warty missing data support
Lack of transparency into memory use, RAM management
Weak support for categorical data
Complex groupby operations awkward and slow
Appending data to a DataFrame tedious and very costly
Limited, non-extensible type metadata
Eager evaluation model, no query planning
“Slow”, limited multicore algorithms for large datasets

One day someone may address #9.

RAPIDS doesn’t solely address these issues; there’s a huge emphasis on “ecosystem.” RAPIDS wouldn’t be possible without the Accelerated Data Science Ecosystem it helps bridge. First and foremost, RAPIDS is built on Apache Arrow, a cross-language development platform for in-memory data. If it wasn’t for the Apache project and its contributors, RAPIDS would have been much more difficult to build. Next, let’s not forget Anaconda, Peter Wang, and Travis Oliphant (who brought us numerous PyData libraries that kick-started it all), and all they’ve done to encourage and emphasize performance in the PyData ecosystem. Numba (2012) provides the python ecosystem with a JIT compiler, that can also target GPUs, which RAPIDS uses heavily in all our libraries. Being able to arbitrarily extend functionality and write User Defined Functions (UDFs) in pure Python gives the Python ecosystem advantages that aren’t present in many other languages. There’s also Dask (2014), a Python-native scheduler that is used across the Python ecosystem with ties into almost every scheduler including Slurm, Kubernetes, and Yarn. GoAi (2017) brought together many pioneers in the GPU Analytics space to prototype the foundations of RAPIDS, as well as to set standards for how GPU libraries communicate and interoperate with each other. Finally, speaking of interoperability, there are the numerous CUDA Python array and deep learning libraries (PyTorch, MxNet, Chainer, CuPy, and soon PaddlePaddle) that have adopted DLPack and the __CUDA_Array_Interface__ (and hopefully many more will join in). The combined force of all these libraries that are connected in the RAPIDS ecosystem allows new libraries like cuSpatial, pyBlazing, cuXFilter, and GFD (more below) to be created quickly, and this trend will continue.

Personally, this is what I love the most about RAPIDS — democratizing GPUs for the Python Ecosystem to allow others to build performant libraries with diverse functionality faster than ever. In the spirit of Top 10 lists, I also asked each of the RAPIDS library leads to say what they love about RAPIDS (you’ll notice they’ve spent too much time with each other because many of them say the same thing).

The RAPIDS Library Leads Top 10 List:

Keith Kraus — @keithjkraus

Speed — Core Implementations are “close to the metal”
GPU Ecosystem Interoperability
PyData Ecosystem Interoperability
Strong memory layout semantics
Low-level access and control (I can get raw pointers to my data if I want!)
Open Source
Deep Learning Framework Integrations
Following known PyData APIs
SQL via BlazingSQL

John Zedlewski — @zstats

I remember spending hours every day waiting for machine learning jobs to complete in batch mode on a large cluster, so it’s still always a pleasure to see much larger jobs finish in seconds on a desktop!

Bartley Richardson — @BartleyR

For data scientists specializing in an area (such as cybersecurity and information security), interoperability between other Python tools is essential. It’s fantastic to benefit from accelerated data analysis (often TB+ datasets in cybersecurity) all while maintaining interoperability with domain-specific downstream Python packages and APIs upon which security analysts rely.

Mark Harris — @harrism

Our amazing team. The RAPIDS team is a diverse and distributed team of enthusiastic, can-do people. Even though we are spread out all over the world, many of us working from home, our team communicates openly and cooperates to build new features and solve problems with impressive speed. Everyone steps up to help, often pushing themselves outside their areas of experience in order to learn new skills. And we have a great time doing it.

Brad Rees — @BradReesWork

Seamless transition between ETL, data engineering, Machine Learning, and Graph analytics. RAPIDS allows the data scientist to think about analysis and not about how to move data between tools

Matt Rocklin — @mrocklin

I like that RAPIDS adheres to standard Python APIs, making it easy to integrate with the existing Python ecosystem
I like that RAPIDS contributes to many other Python packages, rather than trying to own everything itself
I like that RAPIDS makes it easy for users to quickly experiment with different kinds of hardware, without having to learn new systems.
I like that RAPIDS enables accelerated performance for new scientific domains, other than just deep learning

RAPIDS Core Library Updates:

cuDF

cuDF has grown rapidly (pun intended!) over the past year with each release introducing exciting new functionality, optimizations, and bug fixes and 0.10 is no exception. Some of the new functionality for the 0.10 release for cuDF includes `groupby.quantile()`, `Series.isin()`, reading from remote / cloud file systems (e.g., hdfs, gcs, s3), Series and DataFrame `isna()`, grouping by Series of arbitrary length in groupbys, Series covariance and Pearson correlation, and last but not least, returning CuPy arrays from DataFrame / Series `.values` property. Additionally, there are optimizations in the `apply` UDF function APIs, as well as the gather and scatter methods via `.iloc` accessors.

In addition to delivering all of the great features, optimization, and bug fixes, the 0.10 release of cuDF spent significant effort building towards the future. This release merged the cuStrings repo into cuDF and set the stage to merge the two codebases to make string functionality more tightly integrated into cuDF to deliver more acceleration and functionality. Additionally, RAPIDS added a cuStreamz metapackage to streamline GPU-accelerated stream processing using cuDF and the Streamz library. cuDF continued to improve its Pandas API compatibility and Dask DataFrame interoperability to make using cuDF as seamless as possible for our users.

Under the hood, libcudf is undergoing a major internal redesign. 0.10 introduces modern new cudf::column and cudf::table classes that provides much more robust control of memory ownership, and lays the groundwork for supporting variable-sized data types in the future, including columns of strings, arrays, and structures. This work will continue in the next release cycle as support for the new classes throughout the libcudf API is built. In addition, libcudf 0.10 adds a number of new APIs and algorithms, including sort-based groupby with null data support, groupby quantiles and median, cudf::unique_count, cudf::repeat, cudf::scatter_to_tables, and more. As always, this release includes a number of other improvements and fixes.

RMM, the RAPIDS Memory Manager library, is also undergoing some restructuring. This includes a new architecture based on memory resources, which are mostly compatible with C++17 std::pmr::memory_resource. This makes it easier for the library to add new types of memory allocators following a common interface. 0.10 also replaced the CFFI Python bindings with Cython, which enables propagating C++ exceptions to Python exceptions, allowing more useful errors to be passed up to the application. Exception support in RMM will continue to improve in the next release.

cuML and XGBoost

When the RAPIDS team started contributing to GPU-accelerated XGBoost, one of the most popular libraries for gradient-boosted decision trees, the contributions came with a pledge to move all of the improvements upstream to the master repository rather than creating a long-running fork. The RAPIDS team happy to announce that release 0.10 ships with an XGBoost conda package based entirely on XGBoost’s master branch. This is a snapshot version that includes many features from the upcoming 1.0.0 XGBoost release. It already incorporates transparent support for loading data into XGBoost from cuDF DataFrames, and it offers the option of a brand-new, cleaner Dask API. (See the XGBoost repo for details.) The older Dask-XGBoost API is now deprecated but will continue to work well with RAPIDS 0.10. To simplify downloads, XGBoost’s conda package (rapids-xgboost) is now included in the main rapidsai conda channel and installed automatically if you install the RAPIDS conda meta-package. (See the getting started page for more details).

Intel Xeon E5–2698 v4 CPUs (20-cores) vs an NVIDIA V100

The RAPIDS machine learning library, cuML, expanded to support several more popular machine learning algorithms. cuML now includes a support vector machine classifier (SVC) model that is up to 300 times faster than the equivalent CPU version. An accelerated TSNE model builds upon the GPU acceleration work from CannyLabs and provides one of the most popular approaches for high-performance dimensionality reduction, while running 1000x faster than a CPU-based implementation. Our random forest implementation continues to improve with each release, and it now includes a layer-wise algorithm that reaches speeds up to 30x faster than scikit-learn’s random forest training. Stay tuned for blog posts in the next few weeks that cover both TSNE and random forests in much more detail!

Dask

Dask standardized deployment across HPC and Kubernetes systems, including adding support to run the scheduler separately from the client, allowing users to launch computations comfortably on remote clusters from a local notebook. Dask also added AWS ECS native support for institutions that use the cloud but may not be able to adopt Kubernetes.

Development continues on UCX for high-performance communication, both for GPUs within a single node leveraging NVLINK and for multiple nodes within a cluster leveraging InfiniBand. The RAPIDS team has rewritten the ucx-py bindings to be cleaner, and resolved several issues around shared memory management across Python-GPU libraries like Numba, RAPIDS, and UCX.

cuGraph

cuGraph has taken the next step along its path of integrating leading graph frameworks under a single easy to use interface. A copy of Hornet from Georgia Tech was brought into RAPIDS a few months ago and has been refactored and renamed to cuHornet. The name change is an indication that the source code has diverted from the Georgia Tech baseline and reflects that the code API and data structures match RAPIDS cuGraph. The inclusion of cuHornet provides a frontier-based programming model, dynamic data structures, and a list of existing analytics. The first two cuHornet algorithms being made available are Katz centrality and K-Cores, along with the core number function.

cuSpatial

RAPIDS 0.10 also includes the initial release of cuSpatial. cuSpatial is an efficient C++ library for geospatial analytics accelerated on GPUs using CUDA and cuDF. The library includes python bindings for use by data scientists. cuSpatial provides a more than 50x speedup on existing algorithm implementations, with many more in development. The initial release of cuSpatial includes GPU accelerated algorithms for trajectory clustering, distance and speed computation, hausdorff and haversine distance, spatial window projection, point in polygon, and window intersection. For future releases, there are plans to add shapefile support and quadtree indexing.

cuDataShader

Parallel to the RAPIDS release, RAPIDS is excited to announce cuDataShader, a GPU accelerated and cuDF backed port for the amazing Datashader. The fast, large scale data visualization capabilities and its python focus make Datashader a perfect fit for use with GPU powered viz. Our first pass achieved an approximately 50x speed up, and based on these results, GPU capability will be rolled into Datashader itself around its next release! So stay tuned. If you’re eager to try it out, the easiest way is to use it in our other viz library, cuXfilter.

cuXfilter

Used to power our mortgage visualization demo (now located here), cuXfilter has been completely refactored to make it much simpler to install and create cross-filtered dashboards, all from a python notebook! As there are many fantastic visualization libraries available for the web, our general principle is not to create our own chart library, but to enhance others with faster acceleration, larger datasets, and better dev UX. The goal is to take the headache out of interconnecting multiple charts to a GPU backend, so you can get to visually exploring data faster. You can learn more in our installation guide, and our 10 minutes to cuXfilter docs. Work is underway to improve cuXfilter, so try it and give feedback, or better yet help add more charts.

RAPIDS Community

It’s users who advance the ecosystem more than anything else. BlazingSQL just released V0.4.5 with faster shuffles on GPUs. Ritchie Ng of ensemblecap.ai released a GPU implementation of fractional differencing (GFD) leveraging RAPIDS cuDF that achieves more than 100x speed-up over a CPU implementation. Also congrats to BlazingSQL for pyBlazing reaching 1,000 stars!

In the coming months, the RAPIDS engineering team will be presenting and giving tutorials at local meetups, conferences, and hackathons across the globe. Come join us at GTC DC, PyData NYC, and PyData LA. The RAPIDS team wants to make it easy for you to present RAPIDS content at your meetups or local events. Find our latest presentation slides here and any logos or RAPIDS branding materials here.

Join the RAPIDS slack, and then join the #RAPIDS-meetup channel and let the team know how they can help support. As always, continue filing issues for feature requests or functionality that doesn’t work as expected, and please star your favorite repo.