RAPIDS can now be accessed on Databricks Unified Analytics Platform

Karthikeyan Rajendran
RAPIDS AI
Published in
3 min readMar 19, 2019

By: Ikroop Dhillon, Karthikeyan Rajendran, and Taurean Dyer

NVIDIA RAPIDS can now be accessed on Databricks Unified Analytics Platform, making it easy for data scientists to leverage RAPIDS for their end-to-end data science workflows. Databricks provides a cloud-based platform designed to make big data and machine learning simple. You can now install RAPIDS packages on Databricks Runtime for Machine Learning in just a few minutes and run on NVIDIA GPUs.

In this technical blog, we demonstrate:

  1. How to Install RAPIDS on a Databricks GPU cluster.
  2. Walk through a Python notebook showing how to read Avro files with Spark, pass the data via Pandas to a RAPIDS cuDF dataframe, and then run a principal component analysis (PCA) algorithm using both scikit-learn and the RAPIDS cuML library.

At the end of the PCA demo notebook, we compare the results of PCA executed on CPU using scikit-learn and GPU using the cuML library. The sample dataset we used in our notebook had 800,000 rows and 542 columns. Our results demonstrated an 25X speedup executing PCA on GPU using the RAPIDS cuML library vs. executing on CPU using scikit-learn.

On a Databricks cluster you can combine the power of Apache Spark for data engineering with RAPIDS for machine learning. An interesting scenario is when you use Spark’s Dataset API to perform feature engineering on large datasets read from distributed filesystems in many data formats. This often leads to a much smaller dataset that can be collected in the memory of the driver machine. At this point you can use cuDF and cuML to prepare features and perform GPU-accelerated training tasks.

Installing RAPIDS on Databricks Runtime

As a prerequisite, you need a Databricks account on Azure or AWS. You need to make sure that your Azure or AWS account has GPU support enabled.

1. Launch a cluster on Databricks (Azure|AWS) with following configuration:

  • Databricks Runtime Version: 5.2 ML GPU or above
  • Python Version: 3
  • Worker Type: NCv2 (for Azure) p3.2xlarge (for AWS)
  • Driver Type: NCv2 (for Azure) p3.2xlarge (for AWS)

Note: RAPIDS requires NVIDIA Pascal architecture or higher GPUs. That means any GPUs with K (e.g. K80) or M (e.g., M60) will not work.

2. Download the RAPIDS_Init_Script html file and import it into your Databricks workspace (Azure|AWS). Attach your notebook to the cluster that you just created and click the Run All button.

3. Configure your existing cluster with the newly created initialization script (Azure|AWS). The script is installed in /databricks/RAPIDS/rapids-init-install.sh.

4. Restart your cluster.

Databricks Cluster Manager

Running PCA Demo Notebook using RAPIDS on Databricks

After installing the initialization script and restarting your cluster, download the spark_rapids_pca_demo html file and import it into your Databricks workspace. Attach your notebook to your GPU cluster configured with RAPIDS. The pca_demo notebook has six main sections:

  1. Downloads the sample data set (.avro file) and places a copy into a local Databricks file storage
  2. Reads the Avro file using Spark (this converts the Avro data into a Spark dataframe)
  3. Converts the Spark dataframe into Pandas dataframe. This section also checks and removes any null values that are present in the dataset.
  4. Provides a brief overview of the key parameters when applying PCA and defines the parameters for the demo
  5. Runs PCA on CPU using pandas dataframe and scikit-learn
  6. Converts the pandas dataframe to a cuDF dataframe and runs PCA using the cuML library on GPU
  7. The final section is to compare the results of PCA using scikit-learn on CPU and RAPIDS cuML on GPU

You can also import several additional RAPIDS example notebooks for other machine learning algorithms from the project GitHub repo into your workspace and run them.

Next Steps

It is easy to set up RAPIDS on Databricks Runtime for ML and start experimenting with sample notebooks. If you do not have a Databricks account you can try it for free at databricks.com/try.

Learn more about the open-source RAPIDS project at rapids.ai. Learn more about Databricks Unified Analytics Platform.

--

--