30x Faster Hyperparameter Search with Ray Tune and RAPIDS

May 15, 2020 · 7 min read

Increasing the accuracy of a machine learning model can translate into significant cost reductions, revenue increases, or even lives saved. But optimizing a model manually can be a time-consuming, labor-intensive process, particularly for models with a large number of different hyperparameters to tune by hand. Hyperparameter optimization (HPO) automates this process, using intelligent algorithms to explore model configurations to find one that maximizes accuracy.

While most data scientists are aware of options for hyperparameter optimization, they often find it impractical to incorporate HPO into everyday model-building for two key reasons:

  1. HPO can require significant development work, and
  2. It can increase model training time dramatically.

Ray Tune and RAPIDS have teamed up to address both concerns. Ray Tune is a scalable HPO library that allows the optimization to be performed in a distributed manner. It provides various search algorithms along with smarter ways to schedule them in order to arrive at the optimal solution quickly and efficiently. Tune is built on Ray, a system for easily scaling applications from a laptop to a cluster. RAPIDS is a suite of GPU-accelerated libraries for data science, including both ETL and machine learning tasks.

Many thanks to both Michael Demoret from NVIDIA for the original notebook and the team from AnyScale for the help with content review and feedback.

In this post, we will show how to both increase the accuracy of our Random Forest Classifier by 5% AND reduce tuning time by 30x. We’ll do this by walking through an end-to-end example of how to perform hyperparameter optimization with a Random Forest Classifier. The intended audience for this content includes any data scientist or data engineer who wants to run HPO experiments faster and more easily.

Scaling with RAPIDS and Ray Tune

Complex models like XGBoost and RandomForest can provide excellent accuracy. However, training large datasets with these models can take hours and sometimes days on CPUs, as they rely on several hyperparameters that need to be tuned.

Moving data between CPU and GPU based workflows is frequently a bottleneck. RAPIDS set out to eliminate these transfers by loading data, performing some ETL tasks, and training the model, all while staying entirely within GPU memory. By keeping the whole workflow on the GPU, processing times are greatly reduced.

Figure 1. Traditional CPU workflow with GPU-accelerated ML libraries(left) and RAPIDS workflow (right)

Now, let’s look at how to use both Ray Tune and RAPIDS together to leverage their advantages.

Figure 2. Example workflow using both RAPIDS and Ray Tune

Example: HPO with Ray Tune + RAPIDS

For this demo, we use the Airline dataset, which contains historical departure and arrival time data for millions of flights from the FAA. The aim of the model is to predict each flight’s arrival delay. To do this, we’ll make use of RAPIDS cuML RandomForestClassifier. The cuML library is part of the RAPIDS project, which implements machine learning algorithms. It enables users to run ML models on GPUs without knowing the details of CUDA. You can learn more about the library and contribute to the development here.

We will walk through a Jupyter Notebook to explain the approach we have taken. You can find the full details in the notebook here.

Below is a brief summary of the steps taken in the notebook:

  1. Download the dataset to a local directory and load it through cuDF into the GPU.
  2. Prepare the dataset for the problem by selecting columns we are interested in and discarding the rest. In this step, we also introduce a field called “ArrDelayBinary” which is set to True if the airlines arrive beyond the “delayed_threshold” and False otherwise. This turns it into a binary classification problem.
  3. Set up Tune training with the Trainable API.
  4. Define the experiment parameters and run the experiment.

Setting up the Trainable API

One way to run experiments in Tune is by using the Trainable class and defining functions within it to implement our experiment. We subclass tune.Trainable into BaseClassTransformer. To do this, we will define functions _setup, _build, _train, reset_config to create and run the experiment, and _save and _restore for checkpointing.

Let’s take a closer look at _train to see how we are creating the model and evaluating the performance. In this notebook, we allow the possibility of choosing “CPU” mode, but it is not recommended to run this on larger ranges and data sizes. This is provided to study the performance.

To keep a clean separation between static configuration and varying hyperparameters, we will then wrap our BaseTrainTransformer as follows:

class WrappedTrainable(BaseTrainTransformer):    def __init__(self, *args, **kwargs):        self._static_config = static_config        super().__init__(*args, **kwargs)

Ray Tune provides various hyperparameter search algorithms to optimize the model efficiently. In this demo, we will have the option of choosing between 2 search algorithms:

  1. Bayesian Optimization Search

BayesOpt in Ray Tune is powered by Bayesian Optimization, which attempts to find the best performing parameters in as few iterations as possible. The optimization technique is based on Bayesian inference and Gaussian processes. It attempts to find regions in the hyperparameter space that are worth exploring. At each step, a Gaussian Process is fitted to the known samples, and the posterior distribution, combined with an exploration strategy is used to determine the next point that should be explored. Eventually, it finds the combination of parameters that yield results that are close to the optimal results.

2. Scikit Optimization Search

Scikit-optimize is a sequential model-based optimization technique. It is built on NumPy, SciPy, and scikit-learn.

These options can be selected in the notebook with “BayesOpt” or “SkOpt” to run the appropriate optimizer. It is worth noting how these two differ in performance in terms of finding the optimal parameters within a search space. Figure 3 and Figure 4 below illustrate the difference in performance between the optimizers.

Trial Schedulers

In addition to search algorithms, Ray Tune also provides Trial Schedulers which allow early trial stopping, perturbing parameters to obtain the optimal parameters quicker. These make the search resource-efficient. We’ve included two options for scheduling in the demo:

  1. Median Stopping Rule

This method stops a trial if its performance falls below the median performance of other trials at similar stages.

2. Asynchronous HyperBand

This enables early stopping using the HyperBand optimization algorithm, which divides the trials into brackets of varying sizes. Within each bracket, the low-performing trials are stopped early periodically. Ray Tune also provides an implementation of standard HyperBand. We recommend the asynchronous version because it provides more parallelism and avoids straggler issues. You can use these options as “MedianStop” and “AsyncHyperBand”.

Setting Up and Running the Experiment

The notebook cell under “setting up the experiment” has variables that define how many trials should be run, the number of rows to be selected for the run, the cross-validation folds, the search and scheduling algorithms, and the parameter ranges. Have a close look at this and select appropriate values before starting the experiment.

Once that is selected, we are now ready to run our experiment. The code is shown.

Notice how the config is defined to take the parameter ranges as Tune objects.

Figure 3. Comparison of Scikit-Learn Optimization and Bayesian Optimization per trial. The solid thick line represents the smoothed mean test accuracy and the thin line with points shows the standard deviation of the mean test accuracy.
Figure 4. This shows the cumulative optimal performance for each of the optimizers and the maximum value they can achieve on the test accuracy.

Results and Next Steps

From this experiment, we can see that HPO easily boosts the model performance. With minimal effort, we were able to achieve an improvement in accuracy from 72% to 77% with just 2.5M rows from 115M in the dataset. The total runtime for the experiment with 50 trials was just under half an hour, whereas the CPU version with just 25 trials took over 17 times longer. Ray Tune provides various powerful options for metrics, optimization, and scheduling algorithms to help arrive at the optimal solution efficiently. The next step would be to experiment with different search and scheduling algorithms and different RAPIDS cuML models on other data science problems.

GTC Digital Live Webinar

Hear Josh Patterson discuss more on the RAPIDS and Ray collaboration during the upcoming GTC Digital live webinar State of RAPIDS: Bridging the GPU Data Science Ecosystem [S22181] on May 28th at 9AM PDT.

Reference Links

Here are some links to learn more about the concepts discussed in the post:


RAPIDS Everywhere