Scale out RAPIDS on Google Cloud Dataproc

Raghav Mani
Jun 12 · 5 min read
  • Spark is a robust, scalable data analytics engine. It came out of the Hadoop ecosystem and supports multiple languages, with Scala & Java being the most commonly used.

Install Google Cloud SDK

Step #1: If you don’t have a GCP account, you can sign up for one here.

Create a Dataproc Cluster

Now that you’re set up with Google Cloud SDK, go ahead and run the following command to create and initialize a cluster:

CLUSTER_NAME=YOUR_CLUSTER_NAMEZONE=YOUR_COMPUTE_ZONESTORAGE_BUCKET=YOUR_STORAGE_BUCKETNUM_GPUS_IN_MASTER=4NUM_GPUS_IN_WORKER=4NUM_WORKERS=2DATAPROC_BUCKET=dataproc-initialization-actionsgcloud beta dataproc clusters create $CLUSTER_NAME  \--zone $ZONE \--master-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS_IN_MASTER \--master-machine-type n1-standard-32 \--worker-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS_IN_WORKER \--worker-machine-type n1-standard-32 \--num-workers $NUM_WORKERS \--bucket $STORAGE_BUCKET \--metadata "JUPYTER_PORT=8888" \--initialization-actions gs://$DATAPROC_BUCKET/rapids/ \--optional-components=ANACONDA,JUPYTER \--enable-component-gateway
  1. number of workers using NUM_WORKERS
  • Install RAPIDS conda packages
  • Start dask-scheduler and workers

Connect to Cluster Using Jupyter

Go to Clusters page on GCP to see the cluster that you’ve successfully launched.

Explore Example Notebooks

Check out this notebook to see how you can scale out RAPIDS in a multi-node, multi-gpu environment. It is based on the NYC taxi dataset, and it shows how you can use dask_cudf to distribute dataframe manipulations on GPUs, using Pandas-like apis. It then builds an XGBoost model using dask_xgboost, which is the distributed version of GPU accelerated XGBoost.

Monitor Progress using Dask Dashboard

RAPIDS uses Dask for distributing & coordinating work across the different nodes in your cluster. Dask has a suite of powerful monitoring tools that can be accessed from a browser. Follow the instructions below to get access to this Dask dashboard:

Step #1: Get IP address of the cluster’s master node

Run the following command to obtain the IP address of your cluster’s master node. You’ll need to change the bolded parameters based on your GCP account and cluster:

gcloud compute --project YOUR_PROJECT_ID ssh --zone YOUR_COMPUTE_ZONE YOUR_CLUSTER_NAME-mcurl

Step #2: Set up an SSH tunnel between your local machine and the master node

ssh -i ~/.ssh/google_compute_engine -L 8787:localhost:8787 YOUR_USERNAME@EXTERNAL_IP_OF_MASTER_NODE

Wrap Up

Once you’ve explored RAPIDS to your heart’s content, you can tear down the cluster using this command:

gcloud dataproc clusters delete YOUR_CLUSTER_NAME


RAPIDS is a suite of software libraries for executing end-to-end data science & analytics pipelines entirely on GPUs.

Raghav Mani

Written by

Developer Relations @ NVIDIA



RAPIDS is a suite of software libraries for executing end-to-end data science & analytics pipelines entirely on GPUs.