Accelerating Random Forests up to 45x using cuML

Vishal Mehta
RAPIDS AI
Published in
11 min readNov 13, 2019

Random Forests are a popular machine learning technique for classification and regression problems. By building multiple independent decision trees, they reduce the problems of over-fitting seen with individual trees. In this post we will review the basic Random Forests algorithms, show how their training can be parallelized on NVIDIA GPUs and finally present benchmark numbers demonstrating the performance. If you’d like to drill into more detail on the Random Forests algorithm, you may want to read a more detailed explanation from Towards Data Science or watch a lesson from Fast.ai

Random Forests

The main idea behind Random Forests is to learn multiple independent decision trees and use a consensus method to predict the unknown samples. Additionally, Random Forests use the techniques of bagging and feature sub-sampling, in order to make sure that no two resulting decision trees are the same.

With bagging (bootstrap aggregation), each decision tree is trained upon a different sample, with replacement of the original dataset. In this bootstrapped dataset, a given sample (row) of the training data can exist multiple times due to replacement.

When deciding on a split during tree building, only a random sample of features (columns), without replacement, is considered. Using feature sub-sampling can highlight different aspects of the dataset that could otherwise be unnoticed if overpowered by more “important” features.

Training

Let us understand this with an example.

The above is a toy classification dataset of fruits, based on their physical appearance. Let’s assume that we want to build a Random Forest containing 3 trees to classify different fruits in this dataset.

The first step is to create 3 datasets (one for each tree) after applying bagging (with replacement) and feature sub-sampling (without replacement) on this input dataset. In this toy dataset we are not demonstrating feature sub-sampling due to small size. Let us say that these are as follows:

Dataset 1:

Dataset 2:

Dataset 3:

After this, we build 3 independent decision trees using each of these sub-sampled datasets. The tree building algorithm will be covered in detail in the next section.

Inference

To infer the output class (for classification problems) or predict the output (for regression problems) of a previously unknown input sample, the sample is passed through every decision tree in the forest and individual tree predictions are noted. In the case of classification, the output class label is decided based on the majority vote of all decision trees. In the case of regression, the prediction will be the mean of all individual tree predictions.

Taking the above toy dataset as an example, our goal now is to determine the name of the fruit, given a new measurement.

Example 1: {0.0, 1.0, 0.0, 18cm}. This sample has 1.0 for the green color and 18 as size. Classifying this using the below Decision Trees will lead to majority{Apple, Watermelon, Watermelon} = Watermelon

Example 2: {1.0, 0.0, 0.0, 1cm}. This sample has 1.0 for the color red and size of 1cm. Classifying this using the below Decision Trees will lead to majority{Strawberry, Strawberry, Cherry} = Strawberry

Decision Tree Algorithm

Since Random Forests are a collection of decision trees, they need to start with an efficient algorithm to build a single tree.

Finding Splits

Finding the best split at a particular node involves two choices: choosing (a) the feature and (b) a split value for that feature that will result in the highest improvement to the model. The datasets sent to each of the two children of this node should have lower “impurity” than the parent node — i.e., the examples within each node should have outcomes that are as similar to each other as possible.

Splitting nodes continues until either all values in the subset mapping to that node are pure (e.g., all fruits are strawberries) or some other conditions are met (e.g., maximum tree depth, maximum number of samples per tree node, etc.).

At each node, the algorithm uses a specified metric to estimate how much a potential split will improve the model. cuML supports a number of split goodness metrics (see the table below). For classification, either Gini Impurity or Entropy is used, while for regression either Mean Squared Error (MSE) or Mean Absolute Error (MAE). The user can specify which metric to use (split_criterion option in Python), but for most users, it’s not critical to know the details of these metrics and the default works well.

Let us assume we are using the Gini Impurity metric to decide how to split a tree node. A potential split S, defined by (feature split_value), will split this node’s dataset of ‘N’ rows into left and right subsets with N_left and N_right rows respectively. The improvement of Split S is computed as

improvement = Giniᴾᴬᴿᴱᴺᵀ — impurity, where

impurity = N_left/N * Giniᴸᵉᶠᵗ + N_right/N * Giniᴿⁱᵍʰᵗ

The algorithm computes the potential improvement of many potential splits, for all the features, and selects the split that yields the maximum information gain. Because the number of split options is so large, only a subset of potential split values is considered per feature.

Accelerating split calculation with quantiles and histograms

cuML’s Random Forest library contains two high-performance split algorithms to select which values are explored for each feature+node combination: (a) min/max histograms and (b) quantiles. In both cases, at most n_bins split values are considered per feature. The user can specify which split algorithm to use (split_algo option in Python), as well as the number of bins (n_bins option). These approaches draw inspiration from the algorithm used in GPU-accelerated XGBoost and greatly reduce the work needed for split computation relative to an exhaustive search.

Min/max histograms: A histogram is built for every feature for a tree node. Every feature’s data range [min, max] is split into n_bins equally sized bins. The end range of each bin is considered as a potential split value. With this approach, the split values for each feature are recomputed at each node, thus adapting to the data ranges at each tree node. The min/max algorithm also helps in isolating outliers in the data at an early stage during the tree building process.

Quantiles: The quantiles split algorithm precomputes the potential split values for each feature once per tree, i.e., at the root node. Each feature (column) is sorted in ascending order, and split into n_bins such that each bin contains an equal portion of the root node’s dataset. The end range of each bin is considered as a potential split value. Thus, unlike min/max histograms where all bins have the same width, in quantiles all bins, except for the last one, have the same height for the root note, but are of variable width. As split values are precomputed once per tree, the quantile approach is faster than the min/max histogram algorithm. If, for a node deep in the tree, all feature values fall under a single bin, then no splitting can take place for that feature. A further optimization (supported with the quantile_per_tree Python option) is to compute the split values once per random forest, i.e., for the original non-bootstrapped dataset, rather than once per decision tree.

Leaf Nodes

A regular, non-leaf, decision tree node holds a split condition such as (feature split_value), but a leaf tree node holds a prediction. If the leaf node is not pure, i.e., it contains samples with different target feature values, the prediction is computed as follows. For classification, the prediction is the label appearing more often (or one of them in case of a tie), while for regression the prediction is the arithmetic mean of all target values.

Building Decision Trees — putting it all together

Building individual decision trees is where the heavy lifting of Random Forest is done.

Individual trees are built using a list of bootstrapped samples, as was discussed above. Many algorithms use a top down approach, proceeding with depth-first splits of each node then each newly-created child node. In a GPU context, this can lead to launching an enormous number of CUDA kernels — one per node. These small kernels quickly get queued up as launch time begins to dominate the processing. To remove this bottleneck, cuML uses a breadth-first algorithm, building a full layer of the tree at a time. This makes the runtime of the algorithm scale roughly linearly with depth.

The decision tree building process has a simple structure:

As individual Decision Trees are completely independent, building multiple decision trees is embarrassingly parallel. In some cases, the work needed to build a single tree may be too small to fully occupy a large GPU with thousands of CUDA cores. To take advantage of the whole processor, the cuML algorithm can build several trees in parallel on a single GPU. Each tree is built in its own CUDA stream controlled by an OpenMP thread on the CPU.

Building forests across multiple GPUs

cuML recently added an experimental feature to take this parallelism one step further and construct trees in parallel across multiple GPUs on the same node or across a cluster. This approach builds on the Dask distributed processing library.

In the distributed random forest approach, the developer first uses Dask to distribute the training data to all worker GPUs and then fits a cuml.dask.ensemble.RandomForestClassifier. The data can be randomly split and shared equally across all workers, in which case each worker builds trees on a subset of the full data. Alternatively, training data can be replicated so each worker has a complete view of the dataset. In practice, the random sharing approach effectively expands the amount of available memory and typically works well, but it may very slightly reduce model accuracy.

For a random forest with T trees and W workers, each worker will build T/W trees on its 1/Wᵗʰ fraction of locally-available data. As very little communication is required, random forests can scale efficiently to many GPUs. At inference time, predictions from trees on all of the workers are combined, just as if the trees had all been trained on a single GPU.

The Dask RF features in cuML are still experimental, and the API is subject to change in future releases. But it’s a great chance to check out the future of distributed RF and see how it works for your application. For more API details, see the cuml.dask.ensemble.RandomForestClassifier docs page.

cuML Random Forest examples

Side by side Single GPU RF with scikit-learn

As with other modules in cuML, the random forest implementation follows the scikit-learn API closely. So you just need to instantiate a random forest object and then call the “fit” and “predict” methods.

Building on multiple GPUs with Dask

Parallelizing to multiple GPUs with the experimental Dask interface is straightforward. This approach starts by distributing the data evenly across all the workers and then fits a cuml.dask.ensemble.RandomForestClassifier.

Benchmarks

Single GPU

Let us start by looking at the performance of random forest training in cuML comparing against sklearn. In the following tests, we use the release branch-0.10 for cuML and version 0.21.2 for sklearn, running on an NVIDIA DGX-1 server with eight V100–16GB GPUs and dual Xeon E5–2698v4@2.20GHz CPUs with 40 CPU cores in total. We use one V100–16GB GPU for single-GPU cuML runs and the maximum number of threads available, i.e., 80 CPU threads on the DGX-1 server, for sklearn runs. To ensure the best performance, we use a GPU dataframe as input to cuML and a numpy array as input to sklearn.

To analyze the performance in a real-world scenario, we train models on the Higgs dataset, which has 28 columns and 11M rows. We randomly pick 95% of the total rows (10.5M) for training. We use 1000 rows for testing. The following chart shows speedup in training time of cuML over sklearn, as well as the accuracy achieved by each model during testing. In all cases, higher is better. Even though there are quite a few training parameters that we can adjust, we only consider two in this blog post example:

  • n_trees — the number of trees in the random forest
  • max_depth — the maximum depth of each tree

From these examples, we can see a 20x — 45x speed-up by switching from sklearn to cuML for random forest training. Random forest in cuML is faster, especially when the maximum depth is lower, and the number of trees is smaller. Moreover, in the case when there are 1000 trees and the maximum depth is 16, cuML still has ~20x speed-up compared to sklearn. It is worth mentioning that the speedup we get from cuML comes without sacrificing accuracy. The accuracy difference between cuML and sklearn is minimal for the Higgs dataset.

We also repeated the experiments with make_classification() dataset available from sklearn. For a dataset of 1M samples and 100 features, we see speed up in the range of 25x — 60x. The difference in accuracy between sklearn and cuml is minimal here as well.

Multi-GPU with Dask

To fully understand the best performance we can get from random forest training using GPUs, we extend our test to multi-GPU runs using the Dask based distributed approach mentioned earlier. In the following chart, we showcase the speed-up of multi-GPU runs compared to single-GPU. We use the same dataset (Higgs with 8.8M rows to train and 1000 rows to test) and the same hardware (DGX-1 server with eight V100–16GB GPUs) in this section. For the multi-GPU tests, we chose to use 1000 trees per model and maximum depth equal to 8, 12, or 16.

The graph above shows ~40x — 50x speedup for max_depth 16 when using eight V100 GPUs instead of one. As discussed in Section Building forests across multiple GPUs, both the dataset and trees are distributed across multiple GPUs. This is why we can get better than linear speedup scaling across multiple devices. Noticeably, there are indeed some small differences in terms of accuracy between single-GPU and multi-GPU runs, which is to be expected given the different data distributions. However, these small accuracy differences can almost be neglected with the large performance gain we get by scaling the training to multiple devices.

Known limitations and Future Work

The random forest module is new in cuML and has a few limitations planned for development before the next release (0.11):

  • In the Python layer, Random Forest objects do not yet support “pickling” for persistence. We plan to support both Python pickling and easy treelite-based persistence for the underlying model that will make it easy to deploy with the cuML Forest Inference Library. (See this blog for more details on forest inference.)
  • When building very deep trees for wide datasets, the memory footprint of the Random Forest builder can start to exceed available GPU memory. We’re working on a number of interesting improvements to squeeze deeper forests into less GPU memory.

References

Blog Notebook:

Multi-node multi-GPU RF demo Notebook:

--

--

Vishal Mehta
RAPIDS AI

Vishal Mehta works as Developer Technology with NVIDIA, working with performance optimizations of HPC and Machine Learning Applications on GPUs.