Optimizing Machine Learning Models with Hyperopt and RAPIDS on Databricks Cloud

John Zedlewski
Jun 24, 2020 · 7 min read
Parallel coordinates plot from a hyperopt experiment

TL;DR A quick tutorial on how to use the Hyperopt HPO package with RAPIDS on the Databricks Cloud to optimize the accuracy of a random forest classifier. (Full code samples available here.)

In our earlier blog, we showed how to use MLFlow together with RAPIDS cuML to train models faster and manage them more efficiently. These tools allow individual data scientists and teams to iterate quickly on models and continuously improve them while maintaining a smooth flow to production. But as it becomes easier to train large numbers of models, we have to ask — why not let an algorithm do the job for us? Here, hyperparameter optimization (HPO) can step in and help us build more accurate models with minimal effort.

This blog will walk you through using the hyperopt HPO package together with RAPIDS on the Databricks cloud to optimize the accuracy of a simple random forest classifier. RAPIDS will accelerate model training, and the Databricks cloud will parallelize it over as many workers as you want. In the training example (shared here in the cloud-ml-examples repo), this leads to an optimized model about 40x faster than a similar model built on a CPU without RAPIDS.

What is Hyperparameter Optimization?

Most data scientists have looked at the documentation for their algorithm of choice and marveled at the number of configurable parameters. XGBoost (the popular gradient-boosted decision tree library) has nearly 30 parameters that can be set for training, and many ML algorithms are similarly tunable. Proper parameter tuning can often make the difference between a mediocre model and an excellent one. Outside of machine learning competitions or academic settings, few ML engineers have the time to explore every parameter.

HPO systems automatically build dozens or hundreds of model variants with different parameters, compare their accuracies, and allow users to select the best model. Algorithms for HPO have been an area of academic interest for decades. They have increased in popularity recently as high-performance computing, and GPU acceleration makes it feasible to train larger quantities of models.

Naive HPO systems can explore the available parameters by doing a “grid search” — evenly exploring combinations of parameters across their valid range. For instance, to explore max_depth within the range of (5, 14) and n_estimators within the range of (50, 500) with increments of 50, grid search would evaluate all 100 possible combinations of parameters:

This approach can be exhaustive but very slow and often inefficient. If we train a number of models and see that low max_depth always leads to poor performance for many values of num_trees, do we want to spend our compute time to keep trying that parameter value again and again?

Intelligent HPO algorithms continuously evaluate the accuracy of past models and use them to guide the next steps of parameter exploration. Hyperopt, one of the most popular HPO systems, relies on an algorithm called Tree-structured Parzen Estimators (TPE) to speed up this process. TPE models the relationship between the hyperparameters and the objective function with piecewise functions that have a tree-like structure. For more detail on the internals of TPE, see Bergstra et al. 2011.

But you don’t have to master the details of TPE, because the open source hyperopt package encapsulates the algorithm in a friendly Python API. Hyperopt can be used standalone on any system to optimize a model, but it gets most interesting when combined with a distributed platform like Databricks.

Getting started with RAPIDS and Hyperopt on Databricks

In the Databricks directory under the RAPIDSAI/cloud-ml-examples repo on GitHub, we have an example notebook that walks through all of the code necessary to optimize a simple model with hyperopt on Databricks and track it through MLFlow. The repo also contains init scripts that make it easy to install RAPIDS on a Databricks cluster.

To build your RAPIDS-on-Databricks cluster, start by choosing a Databricks runtime that supports GPUs. We tested these scripts on “Runtime 6.6 ML” (later runtime versions may not be compatible yet). Choose a worker type that supports modern NVIDIA GPUs (Pascal or later) and choose the same instance type for the driver. For AWS-backed clusters, any of the “p3.” instances or “g4dn.” instances will support RAPIDS — just steer clear of the much older “p2.” instances, which have an earlier generation of GPU.

Instance types with GPU support

To install RAPIDS, you’ll need to copy the RAPIDS initialization scripts to your DBFS datastore. Copy the rapids_install_cuml0.13_cuda10.0_ubuntu16.04.sh script from the cloud-ml-examples repo to a DBFS location of your choice, e.g., by running the following commands in a Databricks notebook:

In the Advanced Options section of the configuration, add the DBFS path of this script as an init script for your cluster:

Configuring the init script

Now, when the cluster launches, it will automatically install RAPIDS. This process may take 10–15 minutes during launch, but it will not add additional runtime when you kick off jobs post-launch.

That’s it! You can launch a notebook, attach it to this cluster, and start using RAPIDS interactively right away. For some smaller jobs, you can compute directly within the notebook, running on the driver node. But for large-scale HPO, we’ll configure hyperopt to launch jobs on multiple workers to allow true scale-out.

Integrating with Hyperopt

Databricks has already installed a custom version of hyperopt in the ML environment on its cloud so that we can import it right away. We’ll start with a simple training function based on the same dataset and estimator we used in the MLFlow blog. Again, we’re training a classifier to predict whether a flight will arrive on time, using a dataset of past flights from the US Federal Aviation Administration. We want to tune the n_estimators (number of trees to build), max_depth(how deep each tree can be), and max_features (fraction of features to consider at each split) hyperparameters to maximize accuracy.

Integrating a model with hyperopt requires four steps:

  1. Define your objective function,
  2. Define your parameter space,
  3. Create a SparkTrials object, and
  4. Launch hyperopt with the “hyperopt.fmin” function.

1. Define your objective function

Hyperopt will explore parameters to minimize this function, and we can freely define it however we like. In a machine learning problem, this function will typically entail training a model with the given parameters, then evaluating it on a held-out validation set to measure its accuracy.
To match the hyperopt API, this objective function should return a dictionary with a “loss” value for the system to minimize. In the sample notebook, this is the “train_rapids” function.

2. Define your parameter space

For each parameter we want to explore, we need to give hyperopt a range of values to consider, in the form of a probability distribution. The hyperopt API provides probability range functions, like hyperopt.hp.uniform(parameter_name, lower_bound, upper_bound) to sample uniformly between the bounds. Hyperopt also supports distributions with integer ranges or categorical values.

3. Create a “SparkTrials” object

This will track our progress and allow integration with Apache Spark.

4. Launch hyperopt with the hyperopt.fmin function

This will launch up to MAX_PARALLEL jobs in parallel (potentially limited by the number of workers in your Databricks cluster), track their results and return the best result achieved.

The Databricks notebook UI allows you to track runs in progress as it optimizes. In the “Runs” tab of the notebook, you can see a quick summary view, but by clicking on the “Experiment UI” link at the bottom of this sidebar, you can get to a much richer view.

Runs view sidebar

Here, we can select metrics from our recent runs to the graph. Here, we’ll plot how accuracy varies as we increase the max_depth parameter.

Graphing in the Experiment UI

To build a large-scale model with the sample notebook, use the “airline_20000000.parquet” input file for the file name parameter. Running on two p3.2xlarge nodes (with one V100 GPU in each), this will take about 17 minutes to run on GPU. Model accuracy varies from about 83% for worse models to around 86% for high-accuracy models. So you can get a significant accuracy boost with just a few minutes of runtime. Depending on your budget and quotas, you can crank up the parallelism even further to complete the jobs faster. Because hyperopt is embarrassingly parallel, it scales extremely well to large systems.

When you want to deploy the model, you can take advantage of hyperopt’s built-in integration with the MLFlow model repository on Databricks. See the MLFlow blog for more details on MLFlow model management.

The notebook also contains sample code to run hyperopt on CPUs with scikit-learn. Be forewarned, though, that running it will take about 12 hours with two parallel nodes.

Wrapping Up

This blog has only scratched the surface of what’s possible with HPO and RAPIDS. The cloud-ml-examples repo contains several other HPO demos. If you have comments, questions, or problems, please file an issue on that repo or raise the question in the RAPIDSAI Slack.

Also, John Zedlewski gave a talk on Optimizing models with hyperopt and RAPIDS on Databricks Cloud at Spark+AI Summit. Check here for the on-demand recording.

RAPIDS AI

RAPIDS Everywhere