RAPIDS Forest Inference Library: Prediction at 100 million rows per second

John Zedlewski
RAPIDS AI
Published in
13 min readAug 29, 2019

By Andrey Adinets and John Zedlewski

Introduction

Random forests (RF) and gradient-boosted decision trees (GBDTs) have become workhorse models of applied machine learning. XGBoost and LightGBM, popular packages implementing GBDT models, consistently rank among the most commonly used tools by data scientists on the Kaggle platform. We see similar interest in forest-based models in industry, where they are applied to problems ranging from inventory forecasting, to ad ranking, to medical diagnostics.

Using cuML’s random forest module, XGBoost, or LightGBM, data scientists have access to several high performance, GPU-accelerated training for forest models. But many users have found that carrying out inference (prediction) is now becoming a bottleneck. This can be an issue in both low-latency cases, where you need to apply a forest model to each instance arriving in a streaming context, or in high-throughput cases, where you run a forest model over millions or billions of samples from a database.

The new, open source Forest Inference Library (FIL) in RAPIDS cuML 0.9 can eliminate these bottlenecks by allowing users to accelerate GBDT and RF inference with GPUs. Users can train models as usual in XGBoost or LightGBM, save them to disk, then use FIL to reload those models and apply them to new data. Using FIL, a single V100 GPU can deliver up to 35x more inference throughput than a CPU-only node with 40 cores.

In this blog, we’ll dive into the details of what makes FIL fast, what features are improving in future releases, and how you can incorporate FIL into your pipelines.

What is a Decision Tree?

A simple decision tree

Let’s start with a refresher on how a decision forest model, or forest for short, works. While RF and GBDT algorithms differ in the way they train the models, they both produce a decision forest as their output. A decision forest is a collection of decision trees. Each decision tree is a binary tree, in which every node has either 0 children (a leaf node) or 2 children (an inner node).

Each inner node references one of the features, or columns, of the data, and contains a threshold for comparison and a comparison operator. If, for a particular data sample, the comparison of its feature value is true, the path to one of the child nodes is taken; if it is false, the path to the other child node is taken. One of the paths is also marked default; it is taken if the feature value for the data sample is not defined.

Each leaf node of a decision tree contains an output value. For a particular data sample, a path from the root always reaches a specific leaf. The output value of that leaf is that the value that is produced by the decision tree for that sample. For ML applications, it is useful to view a tree as a piecewise-constant function of the data sample.

What is a Decision Forest?

A decision forest

A decision forest is just a collection of decision trees with an operation that defines how the output of the forest is defined in terms of the outputs of individual trees. The exact operation can differ, but the most widely used choices are summation, averaging and voting. Summation is typically used with GBDT-trained forests, whereas averaging and voting are used with RF-trained forests. Averaging is typically used for regression problems, whereas voting is used for classification, where the output of each tree is the class of the data sample.

For regression, the raw forest output is typically used, i.e. the sum of tree outputs for GBDT and the average for RF. For binary classification, additional transformations may be required. For example, with GBDT, the forest output is often fed into the sigmoid function σ(x) = 1 / (1 + exp(-x)) to obtain the probability of the data sample belonging to class 1. Comparing the probability with a predetermined threshold is then used to get the class. For RF, voting produces the class of the data sample directly.

Using FIL

FIL is built on a C++ and CUDA core library, but most users will access it through the Python interface. You can find the full documentation on the cuML docs site, but the basic steps are pretty straightforward:

(1) Install cuML 0.9 or later (which contains FIL)

You can install cuML with the conda package manager, by building from source, or by using the built RAPIDS Docker containers. Easy installation instructions are always available on the rapids.ai getting started page.

(2) Build and save your model with XGBoost or LightGBM

With either the XGBoost or LightGBM Python APIs, just call the save_model(<filename>) method on your model after training.

(3) Load your model into FIL

The ForestInference.load() method takes in a model filename and returns an FIL instance. This uses the open source treelite library to parse the package-specific saved model format and convert it to FIL’s internal tree representation.

(4) Use the FIL instance to infer on your data

FIL can take either data stored in CPU memory (e.g. numpy arrays) or GPU-based arrays. As with the rest of cuML, these GPU arrays can come from any library that supports the __cuda_array_interface__ API, including cuDF, Numba, cuPY, and others. You can eliminate data conversion work by passing in a row-major, FP32 GPU array.

Bringing it all together

The whole process only takes a few lines of code, as you can see in the example below:

from cuml import ForestInference
import sklearn.datasets
# Load the classifier previously saved with xgboost model_save()
model_path = ‘xgb.model’
fm = ForestInference.load(model_path, output_class=True)
# Generate random sample data
X_test, y_test = sklearn.datasets.make_classification()
# Generate predictions (as a gpu array)
fil_preds_gpu = fm.predict(X_test.astype(‘float32’))

For more detailed examples, check out the FIL example notebook and the FIL tests in cuML.

FIL Design

FIL is a part of cuML library dedicated specifically to forest inference. As such, itself it provides only forest inference, but no way of training forest models. Because of this, it needs to import forest models trained using other frameworks.

There is currently no standard file format to store forest-based models; every framework tends to define its own format. In order to support multiple input file formats, FIL imports data using the open source treelite library. This enables FIL to support models trained in popular frameworks, such as XGBoost and LightGBM (without categorical data).

The current version of FIL uses only dense trees internally. A dense tree of depth d is a tree in which the distance from the root to any of the leaves is equal to d. Such a tree contains exactly 2 ⋅ 2ᵈ + 1 nodes. In this way, the index of the child nodes can be computed from the index of the current node, and so need not be stored in the node itself. Thus, a forest of sparse trees can be imported into FIL by converting it into the dense layout. The dense tree structure provides a clean design and high performance, but it increases memory consumption for deep trees.

FIL provides several inference algorithms, as different algorithms can be optimal for different forests. However, all algorithms follow the same basic scheme. At the core of the algorithm is computing the sum of predictions of each tree, which needs to be done for each input data sample.

This stage for the basic scheme is shown in the figure below. One data sample is assigned to one thread block, and all threads in a block cooperate in computing the prediction for that data sample. First, the threads load the data sample. The trees are distributed cyclically among the threads in a block. Each thread computes the sum of predictions of the trees assigned to it, and then all threads perform a collective reduction to compute the sum of predictions, which is written into the global memory.

Work distribution among threads in a block in FIL

Using a single data sample per thread block enables FIL to reduce the amount of shared memory required per block, and thus provides support for larger numbers of columns. For instance, for a single row with 256 columns, only 1 KiB of shared memory are required if each element is a 32-bit float. In contrast, with 1 thread per data sample and 128 threads per block, even 96 KiB of shared memory per block are not enough to store the complete data sample.

Tree layout used in the NAIVE algorithm.

The NAIVE algorithm of FIL is the simplest implementation of the basic scheme described above. It also uses an often-used tree layout, in which the nodes of each tree is stored adjacent to each other, in the breadth-first, left-to-right order. Different trees of the forest are then stored next to each other.

The forest layout in the NAIVE algorithm is often-used, but not ideal from the GPU perspective. Recall that in FIL, neighboring trees are assigned to neighboring threads. Thus, for the NAIVE layout, the threads access tree data in a non-coalesced fashion, which results in suboptimal memory performance.

Tree layout used in TREE_REORG and BATCH_TREE_REORG algorithms.

The TREE_REORG algorithm addresses this problem by using a different tree layout. For each tree, the nodes are still stored in the breadth-first, left-to-right order. However, instead of storing the nodes of the same tree adjacently, it uses a different layout. In this layout, the roots of all trees (node 0) are stored first, followed by left children of the roots of all trees (node 1), followed by the right children of the roots of all trees (node 2), and so on. This is possible, as all the layout used to store all trees is dense, and contains the same number of nodes. In this way, if the neighboring threads take the same path through the neighboring trees, they access the tree data at adjacent addresses, in a coalesced fashion. This mostly helps at small node depths; at larger node depths, the paths of threads through the trees diverge anyway, and the effect of the TREE_REORG layout diminishes.

The BATCH_TREE_REORG algorithm uses the same tree layout as TREE_REORG, but in addition to that, assigns more than one data sample to each thread block. This number remains small (<= 4), and the trees are still distributed cyclically among the threads. However, each thread performs inference on all trees assigned to it, and on all samples assigned to the thread block.

After the forest inference proper, additional transformations are performed. They are performed in a separate kernel, and include a combination of any of the following steps:

  1. Averaging, i.e. dividing the sum of predictions by the number of trees; this is required to support RF.
  2. Adding global_bias, which is often used in GBDT models.
  3. Applying the sigmoid function to the output of the previous step.
  4. Thresholding, i.e. comparing the output of the previous step to the specified threshold, and outputting 0 or 1 based on the result of the comparison.

These additional transformations are required to support typical use cases of forest models, and the exact set of additional steps depends on the use case. For example, for GBDT regression, no additional steps are often required, while GBDT classification will actually use the steps 2, 3 and 4. RF classification is typically performed using voting, which is implemented in FIL as a combination of averaging and thresholding, i.e. steps 1 and 4.

Performance Analysis

Performance in FIL depends primarily on three factors:

  • Number of trees (runtime is roughly linear in number of trees)
  • Storage format of input data (row-major GPU arrays are fastest, column-major arrays will need potentially-expensive conversion)
  • Batch size of input data (larger batches are slower but cost less on a per-row basis, as startup costs are amortized)
  • Tree depth (deeper is slower)

Increasing the number of columns and also increases runtime, but this is typically not as significant as the factors above.

To analyze performance in more detail, we trained XGBoost classifiers on several real-world datasets from the GBM-Bench suite.

  • Higgs (28 columns, 11M rows)
  • Airline (13 columns, 115M rows)
  • Bosch (968 columns, 1.184M rows)
  • Epsilon (2000 columns, 500K rows)

We compared FIL (0.9) with XGBoost (version 0.90.rapidsdev1), running on an NVIDIA DGX-1 server with a V100–32GB GPU and dual Xeon E5–2698v4 @ 2.20gHz CPUs. Where applicable, we also compare against treelite 0.9, a CPU-based implementation of forest inference that is particularly fast at small batch inference. XGBoost and treelite are set to run in parallel over all CPU cores. Before running inference, we converted the input data to the framework’s preferred input format: DMatrix for XGBoost, a row-major CPU array for treelite, and a row-major GPU array for FIL.

There is a huge matrix of possible parameters to compare, but for the purposes of this blog, we wanted to call out just a few slices of data that can help users better understand FIL and its performance implications.

Comparing across datasets

We start by looking at one target FIL application — high throughput inference for a reasonably-large model. In this case, we use a 1000-tree GBDT trained by XGBoost on several different datasets with max tree depth of 10, inferring on 1 million rows. For now, we’ll just include the default FIL algorithm (TREE_REORG).

CPU and GPU performance across datasets

For this problem size, FIL is consistently the fastest, exceeding the throughput of XGBoost on CPU or treelite by over 28x. On the airline dataset, a single V100 GPU can infer 15M rows per second, or well over 100M rows per second on an 8-GPU DGX-1 box.

XGBoost’s GPU-based inference is competitive with FIL in some cases, but it struggles on the very wide datasets (Bosch and epsilon). XGBoost uses a fixed thread block size with each thread processing all trees for a single row. With small column counts, this enables storing more rows in shared memory. However, when the column count becomes large, it doesn’t have enough shared memory, and falls back to reading data from global memory. With the shared memory size of 48KiB and 128 threads, this happens when the number of columns exceeds 96. This leads to a significant performance drop of XGBoost prediction on wide datasets.

Comparing FIL Algorithms

As noted earlier, FIL supports three different algorithms: NAIVE, TREE_REORG, and BATCH_TREE_REORG, so it’s worth comparing them in turn.

Small-batch comparison of FIL algorithms
Large-batch comparison of FIL algorithms

In these examples, we see a fairly typical pattern. For small batches (<=100 rows), inference is dominated by startup time, and all of the algorithms are roughly similar. For larger batches, NAIVE is significantly worse, and both BATCH_REORG and TREE_REORG have better performance. For the remainder of this blog, we’ll omit NAIVE, as it is typically not the preferred algorithm for large batch sizes.

Impact of tree depth

To measure the impact of tree depth on inference time, we can zoom in and focus only on models trained on the Higgs dataset. We can see that increasing depth does not have much impact beyond roughly a depth of 8 in this example. However, the deeper trees do consume significantly more GPU memory in FIL’s dense format.

Inference time scaling with increasing tree depth

Scaling — number of rows

Forest inference can be useful in both small-batch/online and large-batch/high throughput cases. The graph below shows how inference time per row scales as the batch size increases for various inference platforms, again using an example with 1000 trees and a max depth of 10.

Inference time scaling as batch size increases

Runtime per row for the GPU-based implementations continues to improve as batch sizes grow, while the CPU-based implementations plateau at around 2.8 usec/row. Treelite outperforms other algorithms at batch size of 100, as it minimizes startup costs.

Tips for maximizing FIL performance

Based on these analyses and extensive internal testing, we’ve found a couple of guidelines that forest inference users should consider.

  • If possible, provide FIL input data in as a row-major (also called ‘C’ order) GPU array. Other formats are supported but will require conversion internally
  • When building a model, consider that increasing tree depth will increase GPU memory consumption
  • Inferring on larger batch sizes can decrease cost-per-row significantly
  • If the input data is very sparse (almost all zeros or missing data), XGBoost’s DMatrix format may be more efficient than FIL’s dense format
  • The FIL_TREE_REORG and FIL_BATCH_TREE_REORG algorithms have similar, typically-efficient performance. You may get incremental gains by measuring which performs best for your particular algorithm.

Future Work

FIL is currently most efficient for workloads with either large batches, wide column counts, or both, as those workloads allow us to fully utilize the GPU. Future versions of FIL will continue to improve performance on “narrow” (few column) and low batch workloads.

While FIL supports importing sparse trees, it uses the dense layout internally, which is not sustainable for very deep trees. In an upcoming version of FIL, we will add optional support for sparse trees, which will reduce memory consumption for deep trees. Next, FIL will add support for multi-class classification. We’re also evaluating native support for categorical variables in models (without a need to one-hot-encode the variable). Finally, for users who need to transfer data from CPU memory, we’re exploring options to overlap this transfer with compute time.

Apart from that, there are some further directions worth looking into. One such direction would be supporting sparse data efficiently. While FIL currently treats NaNs in missing data, a specific input format, e.g. compressed sparse rows (CSR) would be required for efficient handling of very sparse data. Another direction of work could be using reduced precision, e.g. 16-bit floats, in either the input data or the model.

These upcoming features are based on feedback from early testers, but we want your input too! If you have a feature request for FIL or find a problem, please file an issue on cuML at https://github.com/rapidsai/cuml/issues.

--

--