From Hours to Seconds: 100x Faster Boosting, Bagging, and Stacking with RAPIDS cuML and Scikit-learn Machine Learning Model Ensembling

Nick Becker

Published in

RAPIDS AI

5 min readAug 18, 2020

By Nick Becker and Dante Gama Dessavre

Introduction

To achieve peak performance, data scientists often turn to a technique called model ensembling, in which multiple algorithms are combined in clever ways to achieve better results. Common examples include Random Forest (Bagging) and Gradient Boosted Decision Trees (Boosting), but we can use ensemble learning with arbitrary models, too. Scikit-learn provides straightforward APIs for common ensembling approaches so data scientists can easily get up and running. Unfortunately, these techniques are computationally expensive. In many cases, they’re so computationally expensive that they aren’t cost or time effective.

What if you could train these complex ensemble models faster than you’re currently training your single models?

In this post, we’ll walk through how you can now use RAPIDS cuML with scikit-learn’s ensemble model APIs to achieve more than 100x faster boosting, bagging, stacking, and more. This is possible because of the well-defined interfaces and use of duck typing in the scikit-learn codebase. Using cuML estimators as drop-in replacements mean data scientists can have their cake and eat it, too.

Why Ensemble?

Different kinds of model ensembles can provide many benefits, including reduced variance and higher accuracy (see ESLR sections 8.7 and 8.8). As a result, models like Random Forests and libraries like XGBoost have become very popular. But, we can ensemble with non-tree algorithms, too. As a concrete example, see the following example.

Standard Support Vector Regression (SVR) achieves an out-of-sample R2 of 0.41. Boosted SVR achieves an out-of-sample R2 of 0.50, noticeably higher. Though it delivers better results, the boosted scikit-learn SVR is much slower to train and use. Data scientists shouldn’t have to choose between building ensemble models and fast training. Using cuML with scikit-learn gives data scientists the tools they need to do both.

Ensembling with cuML + Scikit-learn

We’ve recently enhanced cuML’s support of scikit-learn APIs and interoperability standards so that it can be used with scikit-learn’s ensemble APIs. Even when working with NumPy based CPU inputs and outputs (currently required for these ensemble model scikit-learn APIs), there are massive speedups. In the following sections, we’ll walk through several small examples that highlight both the ease of use and the impact of using cuML with datasets of a range of sizes.

Voting Classifier

Scikit-learn’s VotingClassifier lets your final prediction come from a vote between multiple independently trained models. In the following example, we vote between the predictions from Logistic Regression and Support Vector Classifier models, giving more weight to the predictions from the SVC model.

With just 50,000 records in the data, using cuML for the Logistic Regression and SVC estimators in the VotingClassifier provides a 100x speedup. By the time we hit 200,000 records, the speedup factor jumps to almost 300x. cuML’s algorithms scale more effectively than their CPU equivalents because of the GPU’s massive parallelism, high-bandwidth memory, and ability to process more data before saturating the available computational resources.

Stacking Classifier

Scikit-learn’s StackingClassifier takes the predictions from individual models as inputs to a “second-stage” classifier to make a final prediction. In the following example, we stack predictions from Logistic Regression and Support Vector Classifier models and use a Logistic Regression to make the final predictions.

With a dataset of 100,000 rows and ten features, training the StackingClassifier is 35x faster, and scoring is more than 350x faster with cuML estimators.

Bagged Regression

Scikit-learn’s BaggingRegressor builds independent models on random samples drawn from the data (bootstrapping), and then aggregates the results to make a final prediction. This is quite similar to Random Forest but can be used with any estimator. In the following example, we bootstrap aggregate K-Nearest Neighbors Regression. KNN can easily overfit, so bagging is a great way to reduce the variance when using this high-capacity model.

We’ve increased our data size to 250,000 records for this example. With 250,000 rows, using cuML for Bagged KNN Regression is 245x faster. From 1.3 hours down to 19 seconds by swapping one line of code.

Boosted Regression

Scikit-learn’s BoostingRegressor builds a model using the AdaBoost algorithm. At a high level, this involves fitting and predicting on the data, increasing the weight of the “difficult” samples in the data, and continuing to train the model with the new sample weights. In the following example, we boost Support Vector Regression.

Even with just 20,000 rows and ten features, dropping cuML’s SVR into scikit-learn’s BoostingRegressor API gives a 140x speedup during training and a 400x speedup during scoring.

Conclusion

Ensemble modeling can lead to better models but is often too computationally expensive to justify. By integrating and dramatically speeding up scikit-learn’s meta-estimators, cuML now allows data scientists to train ensemble models faster than they could previously train individual models. Ensemble learning and autoML libraries built around scikit-learn APIs can unlock speedups like those shown above by allowing users to swap scikit-learn estimators for cuML estimators explicitly or implicitly (via duck typing).

Today, these ensemble modeling APIs require using CPU-based inputs and outputs (e.g., NumPy arrays). The PyData community has been actively working on efforts to streamline using arbitrary arrays (including GPU arrays) in libraries relying on NumPy. Eventually, we hope to support all of these meta-estimators end-to-end on the GPU for even more considerable speedups.

Want to help drive data science software forward? Check out cuML and scikit-learn on Github and file a feature request or contribute a pull request. Want to get started with RAPIDS and access these 100x+ speedups? Check out the Getting Started webpage, with links to help you download pre-built Docker containers or install directly via Conda.