New Features and Optimizations for GPUs in XGBoost 1.1

Rory Mitchell
RAPIDS AI
Published in
4 min readMay 21, 2020

XGBoost’s GPU algorithms have become a mainstay for data science pipelines where GPUs are available. XGBoost and its data structures have recently undergone several stages of redesign and refinement. This post provides a brief overview of features landing in the XGBoost 1.1 release for improving the speed, reliability, and memory consumption of training pipelines. A discussion of each new feature contains small code examples.

New Features:

  • Integration with RAPIDS, CuPy, and PyTorch
  • Determinism
  • Releasing GPU memory
  • Improved multi-GPU training with Dask
  • In-place prediction
  • Accelerated ranking

Integration with RAPIDS cuDF, CuPy, and PyTorch

The training speed of XGBoost has greatly improved, in many cases moving the bottleneck of data science pipelines to the ETL (Extract Transform Load) phase. Python packages such as pandas can be excruciatingly slow with datasets above the 1GB in size. In previous versions, XGBoost was able to accept RAPIDS cuDF dataframes as a data source, but not in a memory-efficient way. XGBoost 1.1 provides a new interface for GPU training called DeviceQuantileDMatrix, providing a memory-efficient pathway from GPU data structures to XGBoost training. The interface ingests data directly from device memory sources and converts it to an efficient quantized training format, skipping intermediate steps. The result is dramatically improved memory utilization in device memory data science pipelines.

The following script shows the difference in device memory usage between the two data structures when training from a CuPy input.

The memory savings in the example below are up to 5x.

Additionally, support is added for the DLPack open tensor structure. From XGBoost 1.1 onward it is possible to train directly on PyTorch tensors.

Determinism

Parallel computation using floating point arithmetic has always been challenging due to the non-associativity of floating point operations. In simple terms this means (A + B) + C != A + (B + C). Changing the order of floating point operations results in very slightly different answers. The result of this for older versions of XGBoost was some small variance in the output model when training with GPUs. XGBoost 1.1 now uses floating point pre-rounding techniques to achieve associativity and run to run determinism.

We run the following script using XGBoost 0.9 and 1.1 to measure the variance in training loss.

The following results show the mean and standard deviation of model log loss over several identical training iterations. XGBoost 0.9 shows some variance between training runs, where training is fully deterministic in version 1.1.

Xgboost version: 0.90
logloss mean: 0.09878086640499531
logloss std: 1.3877787807814457e-17
Xgboost version: 1.1.0
logloss mean: 0.1015225832015276
logloss std: 0.0

Releasing GPU memory

A pain point of GPU training in previous versions was that device memory allocated for training would be associated with booster objects in Python, persisting after training had finished until the booster object was deleted. The following script would quickly result in out of memory errors.

Xgboost version: 0.90
Iteration 0 device memory utilisation: 1780.0mb
Iteration 1 device memory utilisation: 2572.0mb
Iteration 2 device memory utilisation: 3368.0mb

Running the same script with XGBoost 1.1 shows the memory being freed, allowing many models to be trained and persisted in a loop.

Xgboost version: 1.1.0
Iteration 0 device memory utilisation: 1006.0mb
Iteration 1 device memory utilisation: 1006.0mb
Iteration 2 device memory utilisation: 1006.0mb

Improved multi-GPU training with Dask

Multi-GPU training in python environments has become easier than ever thanks to new APIs utilizing Dask as the backend scheduler. See this blog post for an in-depth example.

Thread Safe In-place Prediction

In-place prediction has been added to the Python API in XGBoost 1.1. This enables prediction directly on selected data formats (numpy.ndarray/scipy.sparse.csr_matrix/cupy.ndarray) without the creation of a DMatrix object. In this example, we demonstrate the improved latency of predicting on a number of test datasets using the new API.

Xgboost version: 1.1.0Standard prediction took: 0.1771543025970459
Inplace prediction took: 0.06605768203735352

GPU-Accelerated Ranking

XGBoost 1.1 comes with GPU acceleration for the ‘rank:pairwise’,’rank:ndcg’, and ‘rank:map` objectives. The following example shows the improvement in training time between version 0.9 and 1.1.

Xgboost version: 0.90
Ranking time: 26.08590006828308s
Xgboost version: 1.1.0
Ranking time: 13.082854270935059s

Conclusion and Next Steps

In version 1.1, XGBoost on GPUs is better than ever, integrating more tightly with the data science ecosystem, using less memory, and improving the reliability and usability of training. Find more information in the documentation. Stay tuned for future releases with a greater focus on improved training times for GPU algorithms.

Thanks to the following people for significant contributions to the above features:

--

--

Rory Mitchell
RAPIDS AI

Senior Software engineer at Nvidia and XGBoost maintainer