What is RAPIDS AI?

NVIDIA’s new GPU acceleration of Data Science promises to rock the world — but what is it? (Quick & Easy Overview)

Winston Robson
Jul 12 · 5 min read

Overview

  • Accelerated Data Science — What is RAPIDS?
  • Integration, Accuracy, Time — Why RAPIDS?
  • Open Source Community — Who is RAPIDS?
  • Get Started — How to RAPIDS?

Accelerated Data Science — What is RAPIDS?

RAPIDS is a “suite of open source software libraries and APIs” grouped together for the purpose of providing users the ability to “execute end-to-end data science and analytics pipelines entirely on GPUs.”

RAPIDS utilizes NVIDIA CUDA® primitives for low-level compute optimization, and exposes GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.

The suite also focuses on common data preparation tasks for data science including a Pandas-esque dataframe API which integrates with a variety of machine learning algorithms to hedge “typical serialization costs.”

RAPIDS also includes support for multi-node, multi-GPU deployments, enabling vastly accelerated processing and training on much larger dataset sizes.

RAPIDS flow
  • cuDF — pandas-like dataframe manipulation library
  • cuML — collection of ML libraries that will provide GPU versions of algorithms available in scikit-learn
  • cuGraph — network-X like graphing API

Through it’s Apache Arrow roots, RAPIDS provides native array_interface support so data can be seamlessly pushed to deep learning frameworks that accept array_interface (e.g. PyTorch, Chainer) or work with DLPack and MXNet.

NVIDIA’s focus on Python allows RAPIDS to “play well with most data science visualization libraries.” The team is “working towards deeper integration with these libraries since a native GPU in-memory data format provides high-performance, high-FPS data visualization capabilities.”

Integration, Accuracy, Time — Why RAPIDS?

Apart from capitalizing on Python’s popularity, RAPIDS aims to capture the market by providing increased model accuracy through faster iteration and more frequent deployment.

RAPIDS achieves speedup factors of 50x or more on typical end-to-end data science workflows.

Take the following image from NVIDIA’s Developer Blog for instance;

Data science has traditionally required high coverage and skilled understanding of the data at hand — CPU computing takes a while, so don’t mess up.

RAPIDS provides a much more swift workflow through GPU-accelerated data processing.

For example, take the training of a model to perform home loan risk assessment using all 400GB of loan data for the years 2000 to 2016 in the Fannie Mae loan performance dataset.

geographical visualization of loan risk analysis example

The example loads the data into GPU memory using the RAPIDS CSV reader. The ETL in this example performs a number of operations including extracting months and years from datetime fields, joins of multiple columns between DataFrames, and groupby aggregations for feature engineering. The resulting feature data is then converted and used to train a gradient boosted decision tree model on the GPU using XGBoost.

On one NVIDIA DGX-2 server with 16 Tesla V100 GPUs, this workflow runs 10x faster than on 100 AWS r4.2xLarge instances. Meaning that when compared one-to-one, GPU vs CPU, this performance comes out to an over 50x speedup.

visualization of enhanced performance

Open Source Community — Who is RAPIDS?

RAPIDS is open sourced under the Apache 2.0 license and is intended to be improved and extended upon by help from the community. While significant time and effort have been invested into making the platform to date, we need active contributors to help build its future.

  • RAPIDS + DASK: Dask is an open source project providing advanced parallelism for analytics that enables performance at scale. RAPIDS is actively contributing to Dask, and it integrates with both RAPIDS cuDF, XGBoost, and RAPIDS cuML for GPU-accelerated data analytics and machine learning.
  • RAPIDS Extended Notebooks: Collection of notebooks to help users understand what RAPIDS has to offer, learn why, how, and when including RAPIDS in a data science pipeline makes sense, and contain community contributions of RAPIDS knowledge.
  • RAPIDS + XGBOOST: XGBoost is a well-known gradient boosted decision trees (GBDT) machine learning package used to tackle regression, classification, and ranking problems. The RAPIDS team works closely with the Distributed Machine Learning Common (DMLC) XGBoost organization to upstream code and ensure that all components of the GPU-accelerated analytics ecosystem work smoothly together.
  • RAPIDS + SPARK: The RAPIDS team is working with the community to build a distributed, open source XGBoost4J-Spark + RAPIDS package.

Get Started — How to RAPIDS?

The RAPIDS team and community have developed numerous ways for new users to try out and grow to be RAPIDS-savvy data scientists.

These include:

  • Jump right into a GPU powered RAPIDS notebook with Google Colab
  • RAPIDS Release Selector (pictured below) to ease the instillation and setup process for those wanting to deep dive into the suite
  • The RAPIDS Blog on Medium, featuring in-depth explanations on the latest advancements in RAPIDS {Stories are not under Medium’s metered paywall}
  • Notebooks Extended — a community powered collection of Jupyter Notebooks covering various implementations of RAPIDS

Continued Reading

References

=

Future Vision

A publication centered around high quality storytelling

Winston Robson

Written by

Terrible in onset. Swift in execution. winstonrobson.com

Future Vision

A publication centered around high quality storytelling

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade