RAPIDS Accelerates Data Science End-to-End

Mike Beaumont
RAPIDS AI
Published in
3 min readOct 15, 2018

--

By Shashank Prasanna and Mark Harris | October 15, 2018

Today’s data science problems demand a dramatic increase in the scale of data as well as the computational power required to process it. Unfortunately, the end of Moore’s law means that handling large data sizes in today’s data science ecosystem requires scaling out to many CPU nodes, which brings its own problems of communication bottlenecks, energy, and cost (see figure 1).

A key part of data science is data exploration. To prepare a dataset for training a machine learning algorithm requires understanding the dataset, cleaning and manipulating data types and formats, filling in gaps in the data, and engineering features for the learning algorithm. These tasks are often grouped under the term Extract, Transform, Load (ETL). ETL is often an iterative, exploratory process. As datasets grow, the interactivity of this process suffers when running on CPUs.

To address the challenges of the modern data science pipeline, today at GTC Europe NVIDIA announced RAPIDS, a suite of open-source software libraries for executing end-to-end data science and analytics pipelines entirely on GPUs. RAPIDS aims to accelerate the entire data science pipeline including data loading, ETL, model training, and inference. This will enable more productive, interactive, and exploratory workflows.

RAPIDS is the result of contributions from the machine learning community and GPU Open Analytics Initiative (GOAI) partners. Established in 2017 with the goal of accelerating end-to-end analytics and data science pipelines on GPUs, GOAI created the GPU DataFrame based on Apache Arrow data structures. The GPU DataFrame enabled the integration of GPU-accelerated data processing and machine learning libraries without incurring typical serialization and deserialization penalties. RAPIDS builds on and extends the earlier GOAI work.

Boosting Data Science Performance with RAPIDS

RAPIDS achieves speedup factors of 50x or more on typical end-to-end data science workflows. RAPIDS uses NVIDIA CUDA for high-performance GPU execution, exposing that GPU parallelism and high memory bandwidth through user-friendly Python interfaces. RAPIDS focuses on common data preparation tasks for analytics and data science, offering a powerful and familiar DataFrame API. This API integrates with a variety of machine learning algorithms without paying typical serialization costs, enabling acceleration for end-to-end pipelines. RAPIDS also includes support for multi-node, multi-GPU deployments, enabling scaling up and out on much larger dataset sizes.

The RAPIDS container includes a notebook and code that demonstrates a typical end-to-end ETL and ML workflow. The example trains a model to perform home loan risk assessment using all of the loan data for the years 2000 to 2016 in the Fannie Mae loan performance dataset, consisting of roughly 400GB of data in memory.

For a walk through of how to download the RAPIDS container, run it . and access the mortgage risk analysis workflow notebook visit the original post on the NVIDIA Developer Blog.

RAPIDS is now available as a container image on NVIDIA GPU Cloud (NGC) and Docker Hub for use on-premises or on public cloud services such as AWS, Azure, and GCP. The RAPIDS source code is also available on github. Visit the RAPIDS site for more information.

Originally published at devblogs.nvidia.com on October 15, 2018.

--

--