Run RAPIDS experiments at scale using Amazon SageMaker

Shashank Prasanna
RAPIDS AI
Published in
16 min readFeb 10, 2020

If you worked with machine learning in the 2000s, chances are that your tools, frameworks and go-to algorithms looked very different than what it does today. Deep neural networks now, have become synonymous with machine learning and non-neural network approaches, such as logistic regression, k-nearest neighbors, support vector machines (SVM) and others are now referred to as “classical” or “traditional” machine learning.

RAPIDS is an open source project which aims to accelerate the entire data science workflow on GPUs, including data preparation and “traditional” machine learning training. In this post, I’ll walk through an end-to-end example of how you can take RAPIDS open source libraries and run large-scale experiments in the cloud using Amazon SageMaker.

The intended reader for this guide is the developer, the researcher, the data scientist, the engineer — the technical professional who wants to run machine learning experiments on a cluster, but may not have the expertise or resources to setup and manage clusters.

RAPIDS, “traditional” machine learning, and scaling

Back when “traditional” machine learning was just called machine learning, you spent a significant amount of time on data preparation steps such as selecting the right features, transforming them, combining features to generate new ones, cleaning up missing data and others. After that you opened your algorithm toolbox, and chose from algorithms such as logistic regression, k-nearest neighbors, naive bayes, and support vector machines (SVM) amongst others.

For complex problems that demanded models with more flexibility, you pulled out the big guns — ensemble approaches — which typically included bagged decision trees (RandomForest being a special case) and boosted decision trees (XGboost being a popular implementation). They provided better accuracy, usually at the cost of complexity and processing time. And on that rare occasion you felt adventurous, you’d also try the neural network approach, which typically defaulted to a 2 layered multi-layer perceptron.

So what’s changed?

Datasets are now larger and therefore the same data preparation and machine learning steps now takes a lot longer. Since a single pass through the end-to-end workflow takes longer, this has a compound effect on the time it takes to run multiple experiments with variations of algorithms, features, and hyperparameters.

Speeding up data preparation and machine learning

One obvious way to speed up data preparation and machine learning is to take a page from the deep learning book, and accelerate them on GPUs. GPU-accelerated machine learning libraries have been around for a while and can significantly speed up training. Popular packages such as xgboost and lightgbm offer GPU-accelerated gradient based tree boosting algorithms and very popular in Kaggle competitions.

To use these libraries, you’ll need to load the dataset into system memory, pre-process your dataset using popular packages such as Pandas, and convert it into a format that the machine learning algorithm can use. The prepared dataset is now copied into GPU memory and after training is done, results are copied back to system memory for post-processing and visualization. One downside of this approach is that moving dataset in and out of a GPU can affect overall processing times.

CPU + GPU workflow

To really improve performance, you’ll need to accelerate the full pipeline on GPUs, including the data preprocessing, and this is where RAPIDS comes in. RAPIDS suite of libraries for data processing and machine learning relies on the common Apache Arrow data layer. This means your data never has to leave the GPU and your workflow ends up looking something like this.

GPU based workflow using RAPIDS

Speeding up experimentation

Running experiments sequentially on a single machine vs. running them in parallel on a cluster.

Unfortunately, even after accelerating your application to run as fast as possible on a single machine with a GPU, you still have to go through the process of dataset preprocessing and machine learning several times as you run experiments. Machine learning is an iterative process and you’ll continue to make changes in algorithms, number and type of features, hyperparameter combinations in pursuit of better results.

You could do that on a single machine, but there isn’t a lot of compute and memory room left for parallel fitting of models. The GPU is already consumed parallelizing a single model, therefore you want to add more machines and GPUs to the mix if you want to fit more models in parallel. Therefore it makes sense to setup a cluster of multiple instances with GPUs and submit training jobs so that they can run in parallel.

That is easier said than done. In theory distributing compute sounds great, but as a machine learning practitioner, you may not have the interest, skills or the resources to manage large swaths of machines to train these many models in parallel.

This is where managed services like Amazon SageMaker come in. With RAPIDS you’ve already scaled-up using GPUs i.e. crammed as much compute as possible into a single system. With Amazon SageMaker you can now scale-out, i.e. run variations of your RAPIDS code in parallel on a cluster, without you having to setup or manage the cluster.

All you’ll need to do is bring in your RAPIDS training script and libraries as a Docker container image and ask Amazon SageMaker to run copies of it on a specified number of GPU instances in the cloud. Let’s take a closer look at how this works through an example.

Example: Running a large-scale hyperparameter search experiment using RAPIDS and Amazon SageMaker

Solving a problem always makes it easier to internalize new concepts. Therefore, to demonstrate running RAPIDS at scale using Amazon SageMaker, let’s try and improve the classification accuracy of finding Higgs bosons in the HIGGS dataset.

Here’s the approach you’ll take:

  • Implement your training script using RandomForest classifier. To speed up training you’ll use the RAPIDS’s GPU-accelerated implementation of RandomForest.
  • Build an Amazon SageMaker compatible RAPIDS Docker container image with your training code in it. Test it locally.
  • Push the container image to a container registry.
  • Upload your dataset to Amazon S3.
  • Tell Amazon SageMaker how many GPU instances you need in a cluster to run parallel hyperparameter optimization jobs and then start the hyperparameter tuning job
End-to-end workflow: (1) Start with a RAPIDS training script (2) Build an Amazon SageMaker compatible Docker container (3) Push it to a container registry (4) Initiate hyperparameter search experiment on a Amazon SageMaker cluster

Notice there are no clusters or instances to start, setup, manage, no datasets and code to copy to all the instances, no clusters to shutdown after training.

How and where do I run this example?

If you’d like to jump right in and run the example, you’ll find everything you need here in this github repository. It includes Jupyter notebook, training scripts and Docker files.

You can perform all these steps from your laptop, desktop or a Amazon SageMaker hosted Jupyter notebook instance. The quickest way to replicate this example is to use the Amazon SageMaker notebook instance since it’s a fully managed compute instance running the Jupyter Notebook server, and comes with SageMaker SDK already installed. Simply start a notebook instance, clone the repository with the example and start running the sagemaker-rapids.ipynb jupyter notebook.

You can also replicate these steps on your own laptop or desktop, download and install the SageMaker SDK to your system with a GPU, and run the example Jupyter notebook in the github repository. You will be able to replicate most of the steps without the need for an AWS account — except the part where you’ll run a hyperparameter optimization job on a Amazon SageMaker cluster which runs on AWS.

Training RandomForest using RAPIDS

RAPIDS’s cuML library supports various GPU-accelerated machine learning algorithms such as nearest neighbors, support vector machines, RandomForest, logistic regression among others. And RAPIDS’s cuDF library which offers pandas-like API for data preparation also integrates with GPU-accelerated machine learning libraries such as XGBoost. In this example you’ll be training a RandomForest classifier, but you could quite as easily replace it with another RAPIDS supported algorithm with very little code changes.

RandomForest uses an ensemble of decision trees, trains them independently, and averages their results. Each decision tree is an unstable learner, because it is very sensitive to changes to training dataset and will fit a very different tree when small changes to the dataset are made, such as a skipping row or column. RandomForest uses the instability of the learning procedure to its advantage. By training many trees on different subsets of observations (rows) and features (columns), you get a diverse set of trees. And like all good democracies, every tree gets to vote and ideas that the majority agree with becomes the winner. If there is a bad observation (row) or an erroneous features (column), not every tree gets to see this since each tree is fit on a randomly sampled subset of observations and features. When the majority vote is taken, trees trained on bad or anomalous parts of the dataset get may get voted out, improving the robustness of the model.

Let’s take a closer look at the implementation.

Excerpt from rapids-higgs.py

cudf.read_csv loads the dataset in a CSV file into the GPU memory, and returns a pointer to this dataset in the GPU memory. All subsequent data preprocessing and machine learning training can be done in the GPU memory.

Next we split the dataset into training and testing dataset, and again note this is done in the GPU memory! This script doesn’t do any pre-processing, but you could use a number of things like PCA, TSNE, label encoding and others directly on the dataset in GPU memory using cuML.

Finally we define the RandomForest hyperparameters and fit the model. This script assumes that the hyperparameters are passed in as command-line arguments, since that’s how Amazon SageMaker interacts with your training scripts.

Updating your training script for Amazon SageMaker

In order to make your training script compatible with Amazon SageMaker, you’ll need to make minor changes to it. First you’ll need to make sure that your script can accept hyperparameters as command line arguments. Amazon SageMaker will run your training script on different instances in the cluster and pass different hyperparameter combinations to the script. Your script should be able to accept these hyperparameters apply to the training run. Second, you’ll need to make sure that your script should be able to read the dataset. Amazon SageMaker ensures that the dataset is transferred to each instance and a copy is made available in the /opt/ml/input/data/dataset/ location and your script should assume dataset is available in that location. Rest of your script will remain the same.

Excerpt from rapids-higgs.py

Building your custom RAPIDS Docker container

If you’re new to Docker containers, don’t let it intimidate you. It’s quite harmless, as long as you use limit yourself to using only the features that you need for machine learning, and ignore the rest for time being. Think of it as your personal sandbox — a place where you can setup your program and no one or thing can mess with it. You can move the sandbox to a different location, the outside environment changed but everything inside the sandbox remains the same. You can be confident that your program will run just fine. The key ideas here are process, memory and data isolation. This is an obvious oversimplification, but it is sufficient to understand what’s going on in this example, and use it similarly for your own work.

To build our RAPIDS Docker container compatible with Amazon SageMaker, you’ll start with base RAPIDS container, which the nice people at NVIDIA have already built and pushed to DockerHub (a public repository for containers).

You will need to extend this container by creating a Dockerfile with the following steps follows:

FROM rapidsai/rapidsai:cuda10.0-runtime-ubuntu16.04RUN apt-get update && apt-get install -y — no-install-recommends build-essentialRUN source activate rapids && pip install sagemaker-containersCOPY rapids-higgs.py /opt/ml/code/rapids-higgs.pyENV SAGEMAKER_PROGRAM rapids-higgs.py

Let’s go through it: the Dockerfile says, first pull the latest container from Docker hub, and then install a package called sagemaker-containers to make it compatible with Amazon SageMaker. Next you copy your training scripts to the container. Finally, instruct Amazon SageMaker that your training script is called rapids-higgs.py by setting the SAGEMAKER_PROGRAM environment variable.

To build it, run:

docker build -t sagemaker-rapids:latest docker

After it’s done building, check to ensure you can see your container image. You should see something like this with an entry called sagemaker-rapids

[ec2-user@ip-172–16–80–37 ~]$ docker imagesREPOSITORY TAG IMAGE ID CREATED SIZEsagemaker-rapids latest 7948e55aa3c4 2 days ago 8.64GBrapidsai/rapidsai cuda10.0-runtime-ubuntu16.04 23341e245c4d 7 weeks ago 8.25GB

Testing your Amazon SageMaker compatible RAPIDS container locally

Before you go off and spend time and money on running a large experiment on a large cluster, its always wise to test things locally and make sure it’s doing what it’s suppose to do.

Start by downloading the dataset to your local machine:

mkdir datasetwget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gzgunzip dataset/HIGGS.csv.gz

Next define some default hyperparameters. Take your best guess, your guess is as good as mine. You can find the full list of RandomForest hyperparameters on the cuML doc page.

Start off with some defaults.

Excerpt from sagemaker-rapids.ipynb

If you’re a seasoned machine learning researcher or practitioner, you may have good intuition on what hyperparameters might work well with your problem. You probably guessed that you want to have more estimators (trees) to improve robustness, and you want deeper trees for better diversity, and smaller or larger fraction of observation to sample from based on your understanding of the dataset. Even with all of that, it’s still your best guess and nothing more. Which is why you’ll be running a hyperparameters search experiment to search through various combinations of these values.

Now, run a local Amazon SageMaker training job to ensure the container performs as expected. Make sure you have SageMaker SDK installed on your local machine.

Excerpt from sagemaker-rapids.ipynb

Here, you first specify that the instance type as local_gpu. This assumes that you have a GPU locally. If you don’t have a local GPU, you can test this on a Amazon SageMaker managed GPU instance — simply replace local_gpu with with a p3 or p2 GPU instance by updating the train_instance_type variable. Use p3.2xlarge if you want to run this on a Tesla V100 GPU.

When you run it locally, you’ll see an output that ends with something like this:

......
algo-1-ikchj_1 | Invoking script with the following command:
algo-1-ikchj_1 |algo-1-ikchj_1 | /opt/conda/envs/rapids/bin/python rapids-higgs.py — bootstrap 0 — bootstrap_features 0 — max_depth 5 — max_features 0.2 — max_leaves -1 — n_bins 8 — n_estimators 15 — split_algo 0 — split_criterion 0algo-1-ikchj_1 |algo-1-ikchj_1 |algo-1-ikchj_1 | test_acc: 0.6282582944671835algo-1-ikchj_1 | 2020–02–03 23:12:46,178 sagemaker-containers INFO Reporting training SUCCESStmpksw0arl4_algo-1-ikchj_1 exited with code 0Aborting on container exit…

Congrats, you successfully trained your Random Forest model on the HIGGS dataset using an Amazon SageMaker compatible RAPIDS container. Now you are ready to run experiments on a cluster to try out different hyperparameters and options in parallel.

Push your RAPIDS container to the Elastic Container Registry

When running a large-scale training job either for distributed training or for independent experiments, you will need to make sure that datasets and training scripts are all replicated at each instance in your cluster. Thankfully, the more painful of the two — moving datasets — is taken care of by Amazon SageMaker. As for the training code, you already have a Docker container ready, you simply need to push it to a container registry, and Amazon SageMaker will then pull it into each of the training compute instances in the cluster. Use the Amazon Elastic Container Registry (Amazon ECR) to store your Amazon SageMaker compatible RAPIDS container and make it available for Amazon SageMaker.

aws ecr create-repository — repository-name sagemaker-rapids$(aws ecr get-login — no-include-email — region {region})image = ‘{}.dkr.ecr.{}.amazonaws.com/sagemaker-rapids:latest’.format(account, region)docker tag sagemaker-rapids:latest {image}docker push {image}

In the code above, you first create a new registry called sagemaker-rapids, of course you are free to choose a different name, just make sure you use the same name in subsequent steps.

After that you log-in to the registry. Then tag the container image with the path and name of your registry that includes account and region information. Finally push and you should see the container image in the registry.

Navigate to the AWS Console > Amazon ECR > sagemaker-rapids and you should see your freshly pushed container in this registry.

Define hyperparameter ranges and run a large-scale search experiment

There’s not a whole lot of code changes required to go from local training to training at scale.

First, rather than define a fixed set of hyperparameters, you’ll define a range using the SageMaker SDK

Excerpt from sagemaker-rapids.ipynb

Next, you’ll change the instance type from local_gpu to the actual GPU instance you want to train on in the cloud. Here you’ll choose an Amazon SageMaker compute instance with a single NVIDIA Tesla V100 based GPU instance — ml.p3.2xlarge. If you have a training script that can leverage multiple GPUs, you can choose up to 8 GPUs per instance for faster training.

Excerpt from sagemaker-rapids.ipynb

Now you define a HyperparameterTuner object using the estimator you defined above.

Excerpt from sagemaker-rapids.ipynb

Here are the following key arguments that defines the experiment:

  • objective_metric_name and objective_type: these define the metric for evaluating each training job in the experiment. In this example, the tuner is trying to maximize the test accuracy since the objective_type is set to maximize and the objective_metric_name is set to test_acc .
  • metric_definitions : this defines the metric you want to optimize. If you want to maximize or minimize a specific metric, have your training script print the metric to stdout using regular python print statements, and update the metric_definitions argument with the right regular expression so that Amazon SageMaker can pick it up.
  • max_jobs and max_parallel_jobs : these defines how many training jobs you want Amazon SageMaker to run and how many of them you want to run in parallel on separate training instances with a GPU. Each of these jobs will run with a different set of hyperparameters chosen by the bayesian optimization algorithm. max_parallel_jobs instructs Amazon SageMaker to only run 4 training jobs in parallel. Select max_jobs and max_parallel_jobs based on how many experiments you want to run and what budget and resource constraints you have.
  • strategy : this defines the type of search to perform and can be random and bayesian. The following should give you a quick overview of these two strategies:

Random search selects random sets of values sampled from an exhaustive set defined by the hyperparameter_ranges variable. While it may seem that random search is akin to throwing darts in the dark, Bergstra and Bengio discuss in their paper titled Random Search for Hyper-Parameter Optimization, why random search can find good or better models than searching over a grid of values at a fraction of compute resource. They argue that it should be considered a natural baseline when exploring hyperparameters. Bayesian search takes a different approach by building a surrogate or proxy model over the loss function and finds the optimal hyperparameters over the surrogate function. It’s much easier to optimize over the surrogate model than the actual model which requires retraining. Bayesian approach may yield better results in fewer evaluations compared to grid and random search. When using Amazon SageMaker you don’t have to know how they are implemented, but it’s always a good idea to have some intuition around what’s happening behind the scenes.

Finally, upload your dataset to Amazon S3, so that Amazon SageMaker can make copies of it to each worker. And then initiate hyperparameter optimization by calling the fit function.

Excerpt from sagemaker-rapids.ipynb

To monitor your experiments, head over to AWS console > Amazon SageMaker.

On the left you’ll see Hyperparameter tuning jobs under Training, click on it and then select your current running tuning job.

Click on Training Jobs tab and you’ll see all the jobs that were run for different hyperparameter combinations.

Click on Best training job tab and you should see the training job that yielded in the highest test accuracy and below, the corresponding hyperparameters.

The test accuracy for this run is 73.8% on the validation set, which is an improvement over 62.8% with random defaults I chose during the local training.

“Best” here doesn’t mean best for the dataset. It’s the best of the 32 jobs that ran, with hyperparameters sampled from the ranges I specified, and for the validation set I chose. It basically means there’s always room for improvement. When I ran this example 9 of the 32 jobs failed, since bayesian optimization selected hyperparameter values from the range I specified, which happened to be over or under the limit of what the cuML RandomForest implementation could accept. If you see this, update the ranges to be more conservative, and always consult the documentation on what values are acceptable for each hyperparameter.

Recap

Machine learning involves a lot of experimentation, there is no question about it. As data scientists, researchers and machine learning practitioners, you spend several days, weeks or months experimenting to find the right data transformations, machine learning algorithms and hyperparameters. In this blog post, you saw two ways to speed up this process:

  1. By employing faster implementations of popular algorithms that leverage GPUs using RAPIDS
  2. By running experiments in parallel in the cloud without having to deal with the complexity of setting up, managing and scheduling experiments.

The complete example is available on GitHub:

https://github.com/shashankprasanna/sagemaker-rapids

The repository includes training script, Docker file and Jupyter notebook that walk you through the process step by step. You’ll need an AWS account to run experiments in the cloud. You don’t need an AWS account if you’re only testing the local mode. If you run the example end-to-end, it’s always a good idea to make sure you terminate any lingering AWS resources that can cost you. Make sure you:

  • Delete Amazon S3 buckets and files you don’t need
  • Navigate to Amazon SageMaker Console > Training Jobs and stop training jobs that you don’t want running
  • Navigate to Amazon ECR and delete container images and the repositories you don’t need

Have fun experimenting with RAPIDS and Amazon SageMaker!

--

--

Shashank Prasanna
RAPIDS AI

Talking Engineer. Runner. Coffee Connoisseur. ML @ Modular. Formerly ML @Meta, AWS, NVIDIA, MATLAB, posts are my own opinions. website: shashankprasanna.com