(Code and original text by Matthew Jones)
A Guide to Machine Learning at Scale
On this blog, we talk a lot about RAPIDS, a collection of GPU-accelerated data science libraries— what it is, how it works, and the principles driving its rapid development. Today, I won’t be addressing any of these items; instead, I’m going to show you how to deploy RAPIDS at scale using Azure Machine Learning service. If doing this just so happens to illustrate what RAPIDS is all about (bringing the compute to your problem), then I’d consider that doubly valuable.
Starting at the End — a Distributed ML Workload
I’m not very good at teasing things along; so, I’m just going to show you what all of this looks like after we’ve set it up:
Here, you see a Jupyter notebook alongside the Dask Dashboard. It’s visualizing RAPIDS code processing 10s of millions of records from an 80 GB dataset of New York taxi trips to build a predictive model of fare prices. That all runs on an Azure Machine Learning service cluster with NVIDIA V100 GPUs that can be scaled up to any size you need with a single flag.
The rest of this article will walk through exactly how you can set up similar, GPU-accelerated, multi-node workflows using Azure Machine Learning.
Let’s start with a quick overview. First, you’ll use the Azure portal to set up the necessary resources. If you’re not familiar with the Azure portal, the links below will provide a step-by-step walkthrough.
- Create your Resource Group
- Within your Resource Group, create an Azure Machine Learning service Workspace
- Within your Workspace, download the Configuration json file
- Within your Workspace, check your Usage + Quota to ensure you have enough quota to launch your desired cluster size
Then, from your local machine, you’ll run our helper scripts to get the cluster running:
- Clone the demonstration code and helper scripts
- Run the utility script to initialize the Azure Machine Learning service workspace
- Open your web-browser and play with the demonstration notebook
Diving into the Workload
On your local machine, clone the RAPIDS notebooks-contrib GitHub repository containing demonstration code and helper scripts which streamline launching a RAPIDS cluster on Azure Machine Learning service. In your terminal, navigate to the blog_notebooks/azureml subdirectory from that repo. This directory includes:
- Scripts that you run on your local machine
- start_azureml.py is the main utility that creates a Compute target, and an associated Experiment within a Workspace using the Azure Machine Learning SDK for Python. This can also optionally upload the input dataset to your Azure storage for faster processing times in the cloud.
2. Scripts that the cluster automatically runs at startup
- rapids-0.9.yml is a file specifying the versions of all the RAPIDS packages that the cluster should download as it creates a scalable VM
- init_dask.py is a Python script that each node within the cluster will execute to establish the Dask cluster
Typical usage of this repository will look like this:
- Navigate to the root directory of RAPIDS-AzureML
- Issue the following command at the command-line:
● CONFIG_PATH is the path to your Configuration file
● VM_SIZE is the size of the virtual machine you’d like the Compute target to possess (e.g. “Standard_NC12s_v3" will build a cluster of 12-virtual-core nodes, with two V100 GPUs each)
● N is the number of nodes (virtual machines) you’d like to deploy
There are many more options to this utility script. You may issue
to see more.
When this runs, it will build a new container image, install all of the necessary packages, and add it to your registry. This may take half an hour or more, but don’t worry, this build step is only necessary the first time you run the script. Azure will cache the resulting image for quick re-use.
The script uses the Azure Machine Learning SDK to launch a cluster, scale it up to the desired number of nodes, and link them all into a Dask cluster. To follow along with the process, you can use the Azure portal to view the logs and output as documented below. If the launch is taking a particularly long time, this portal is the best place to debug any issues. In some cases, an Azure region may be so busy that there are no available GPU instances, so the cluster will be unable to scale. If you see that issue frequently, you may want to try another region.
The last step is to open your web-browser, and navigate to http://localhost:8888/ where an SSH tunnel has been created so that you can access Jupyter notebooks running RAPIDS from your local machine.
These scripts currently assume you are running from a UNIX-like environment (e.g. Linux, Mac OS, or Windows Subsystem for Linux) with an ssh client installed.
The notebook is based on the NYC Taxi demonstration from RAPIDSAI Notebooks. It starts with three years of taxi trip data (2014, 2015, 2016). Half-way through the year 2016, the structure (schema) of the data changed, so our ETL scripts need to handle that as well.
The above poses several questions: How do we (1) create an aggregate DataFrame with all data from these years, (2) use this data to create new features which can be, (3), leveraged to train a simple Gradient Boosted Decision Tree (GBDT) using XGBoost? Moreover, what does this process look like with RAPIDS?
The short answers to these questions are easy: the code looks a lot like Pandas, and the time it takes to execute depends on how many GPUs are available within your cluster, but can be as quick as a few seconds on a multi-node cluster.
The parallelization all builds on Dask and uses a typical Dask approach:
- Initialize the client
- Define a set of functions
- Ask the client to load in the data
- Ask the client to run the functions on the data
The Dask client will construct a computational graph that it will eventually execute. Dask is lazy: it doesn’t like to execute until it knows as much as possible about what you’d like it to do (or until it absolutely has to execute).
The majority of this notebook will be familiar to Pandas users since the underlying libraries maintain the same APIs as Pandas wherever possible. The notebook also uses the Pandas-style query API to efficiently perform complex filtering ops. For a more detailed look into dataframe APIs in Dask, see the Dask documentation.
In order to get a solid accuracy measurement in the prediction phase of this workload, we needed to incorporate the calculation of Haversine distance. The RAPIDS cuDF library does not come baked with this function; in the near future, many geo-spatial algorithms will be introduced in the new library called cuSpatial. For now, RAPIDS allows you to define a new, GPU-accelerated function without using a low-level library like CUDA or a similarly flavored Python variant. Not only can you write custom functions that are ported to the GPU for you, but those functions are then parallelized across an arbitrary cluster size for you. Moreover, the code we ended up writing looks like Python. No special decorators, or fancy imports.
Training using XGBoost on GPU is also not complicated. It matches the standard XGBoost API and requires only a single parameter change — the tree_method should be set to gpu_hist:
Evaluating the model and wrapping up
Now that we’ve built our model, let’s look at how accurately it predicts fare prices. Prediction can also be carried out in parallel on multiple nodes:
Overall, our model predicts prices with a root mean square error of only about 2.0 off the true fare prices. Try launching the notebook on your own cluster to see if you can beat that score with a bit of feature engineering and XGBoost optimization. Don’t forget to use the Azure portal to shut down your compute nodes when you’re done with the experiment!
Now that you’ve walked through one example notebook, you can reuse the setup scripts with your own workloads. Just add your own notebook to the rapids folder before running start_azureml.py and give it a shot.
This is just a first step for RAPIDS and Azure Machine Learning Service. We’ll be updating these scripts over time, including adding support for the new RAPIDS 0.10 release. If you find the code useful or have more questions, please reach out to us on Twitter @rapidsai, or via our RAPIDS Slack instance.
Appendix: Azure Machine Learning Service Configuration and Details
Azure hosts a number of services under its Marketplace, including Azure Machine Learning service. These services are categorized. We’re interested in the AI + Machine Learning category. In particular, the Machine Learning service workspace. Everything we deploy will live in a Machine Learning service workspace.
After you’ve logged into the Microsoft Azure portal online, you will see a homepage. It lists popular Azure services, useful links, and your most recent resources. While all of this is useful, we’re only interested in Resource Groups, and how to create them for the purpose of demonstrating RAPIDS at the scale of Azure Machine Learning service. The procedure for creating a resource group is simple.
- Navigate to the Resource groups page
a) You can do this by clicking Resource groups on the home page
b)… or you can click Create a resource above and search for Resource group
2. Click Create, and you’ll see this page
a) You’ll need to select a Subscription with GPU resources available
b) You’ll need to select a Region with GPU resources available
c)… If you’re not sure, check out this page on GPU optimized VM sizes
3. Next, we’ll look at how to initialize the Machine Learning service workspace in the Workspaces section.
Creating a Machine Learning service workspace is easy. To do so, we’ll follow these steps:
- Navigate to your Resource groups page. Be sure to select the correct subscription!
2. Select the Resource group you just created, and click Add … I named mine RAPIDS-AML
a) The blue box indicates where the Add button should be
b) The green box shows you what the resource collection will look like after we’ve added the Machine Learning service workspace
3. Selecting Add, takes us to the Marketplace I mentioned at the beginning of Getting Familiar with Azure Machine Learning service. Now that we have a Resource group, we can add to it the Machine Learning service workspace by clicking on the service icon in the Marketplace and creating the service.
a) Be sure to select the correct Subscription!
b) Don’t forget to set the Location to a region where you have adequate GPU quota!
In order to use the Azure Machine Learning SDK for Python to maximum effect, we need a file that encapsulates the information about a resource in a cleanly accessible format. Enter config.json. This file contains a dictionary list with key-values for the following:
● “subscription_id” — the Subscription
● “resource_group” — the Resource group
● “workspace_name” — the Workspace
To download this file, select the Workspace with which you’re interested in working, and then click a button.
Viewing logs for your experiment
To see real-time output from your experiment, including any error messages, select the experiment name from your Azure Machine Learning workspace. You will see a table at the bottom listing runs from this experiment. Click on the link to the latest run to see its detail page. From there, you can select the Logs tab (shown below) and use the tree of log files to navigate to the one you want to view.
For more info on Azure Machine Learning and RAPIDS
RAPIDS and Azure are each broad ecosystems, so this blog is just scratching the surface. If you want to learn more, here are a few resources we’d recommend:
● Azure Machine Learning service provides extensive documentation.
● RAPIDS documentation is always up-to-date on docs.rapids.ai,
● You can get started learning XGBoost on the RAPIDS XGBoost landing page