xgboost GPU performance on low-end GPU vs high-end CPU

Laurae

Follow

Published in

Data Science & Design

13 min readDec 30, 2018

--

xgboost CPU with fast histogram is extremely fast compared to old school methods such as exact histogram.

How well does xgboost with very high-end CPU fare against a low-end GPU? Let’s find out, in a very unfair comparison.

GPU xgboost was implemented since last year to provide higher performance. Let’s see how much better here!

Note: to run the benchmark on GPU, you will need an NVIDIA GPU. There are no workarounds to that. NCCL is also mandatory for multi-GPU training.

CPUs vs CPUs: previous results

Here is a summary table of contents in case you are lost.

Content:

Potential criticism of using GPU xgboost
Hardware & Software
What do we benchmark?
Benchmark results run by hand
More specific benchmark results
More about RAM usage

Potential criticism of using GPU xgboost

You may have many criticism why (not) using GPU xgboost…

GPU is not providing reproducible results: this is actually the truth in most cases. xgboost GPU does not provide reproducible results.

Logloss of xgboost models for the benchmark run by hand for 12.5 million rows x 100 features. GPU xgboost seesm to provide a random best logloss. Only CPU xgboost provides reproducible results.

GPU have less RAM than CPU: sorry, just purchase the expensive Titan / Quadro / Volta cards. Your business may thank you later (or fire you for reckless purchase orders).

NVIDIA RTX meme: this is not professional

NVIDIA DGX-2 refresh ($490K price tag on a 36 month lease): 32GB RAM per GPU!!!

You can’t fit all data in GPU: sorry, but first you need to know you can use multiple GPUs, and second if the software you use is properly coded (and the algorithm behind allows it), the memory is shared across GPUs thanks to NCCL (distributed computing).

NCCL example: https://devblogs.nvidia.com/fast-multi-gpu-collectives-nccl/

Machine learning on GPU is good only for “deep learning”: sorry, but this is plain wrong. One could say “deep learning is not yet at the level of Johnny Depp”. Small refresher for you who think “deep learning > machine learning”:

Deep learning is just a very small subset of machine learning. LOL !!!!!!!!!!!!

Mutlple GPUs do not scale. Just see how poor the performance using SLI on games!: you are comparing oranges and apples. This is the same as if you were comparing Geekbench on Android vs iOS (or Windows vs macOS): a pure total non sense.

Geekbench 4 multithreaded test: Same CPU spotted with very varying results (Intel Xeon W-2191B: 47240 on iMac Pro, 35536 on Windows?)

GPU does not work well when it is too fast: if you cool it sufficiently, it will work well.

GPU toaster: not giving enough cooling to your toaster will make it die.

Why should I use xgboost on GPU when deep learning is always the best tool?: a tool is just a way to achieve an objective to meet a real need. In most cases, a neural network does not work well on tabular and business data. It’s similar to using a Bazooka to kill a bee!

I can just use a cluster and do the work faster than everyone else in the world!: not really if there is no scalability and someone found the gem for a singlethreaded scenario.

Optimize first the inner most element, until you can’t anymore: switch to the outer element, and do it again over and over.

GPU is always the fastest tool for everything! No need to test!: sorry, it depends on the use case.

Do you even need a nuclear radiation protection for your smartphone? Better get Helium protection first.

Hardware & Software

To compare xgboost CPU and GPU, we will be using the following unfair hardware worth over $15K:

CPU: Dual Intel Xeon Gold 6154 (2x 18 cores / 36 threads, 3.7 GHz all turbo)
RAM: 4x 64GB RAM 2666 MHz (good to go for 80 GBps bandwidth)
GPU: 4x NVIDIA Quadro P1000 4GB RAM (very similar to NVIDIA GeForce 1050 4GB RAM, 4 of them is similar to a 1080)
BIOS: NUMA enabled, Sub NUMA Clustering disabled
Operating System: Pop!_OS 18.10 (like Ubuntu 18.10)
R: 3.5.1, compiled with -O3 -march=native
NVIDIA versions: CUDA 10.0, NCCL 2.3.7
xgboost version: a2dc929

You might wonder, why comparing a miserable Quadro P1000 with a super high end CPU? You will find out later.

And yes, the following hardware you see below is slower than you may think (against our server):

Beware: dual Xeon E5–2699v4 (44 cores, 88 threads, 2.8 GHz all turbo) is slower overall!

Rationale: xgboost fast histogram does not scale well with threads. This was already seen so many times…

Compile xgboost for GPU in R

To compile xgboost in R with GPU support (and multi GPU support through NCCL), we can use a oneliner in R assuming you have the xgbdl package from myself:

xgbdl::xgb.dl(compiler = "gcc", commit = "a2dc929", use_avx = FALSE, use_gpu = TRUE, CUDA = list("/usr/lib/cuda", "/usr/bin/gcc-6", "/usr/bin/g++-6"), NCCL = "/usr/lib/x86_64-linux-gnu")

Note: AVX option is deprecated. Get rid of NCCL if you are using a single GPU. We assume you already installed CUDA and NCCL.

CUDA 10 requires gcc (version 6), and NCCL must be pointed to the right folder. xgb.dl takes all those inputs for you, and perform the work on your behalf so you do not have to do it manually in R.

Installing xgboost for GPU allows you to keep using CPU. You are not restricted to only using GPU once installing the GPU version (but the CPU version allows you to only use CPU).

Monitoring GPUs in Linux

Other than using nvidia-smi, you might be interested in nvtop:

PuTTY users, please use the following to run nvtop:

NCURSES_NO_UTF8_ACS=1 nvtop

Otherwise, you may have funny stuff.

I could not find something better than that (other than nvidia-smi for more details “at X time”), if you have something interesting for GPU monitoring, feel free to share in the comments (do not recommend something like glances , etc.: using a bazooka to solve a problem is not a solution).

What do we benchmark?

There is a very nice script created by kholitov to benchmark xgboost on GPU. We will adapt it to run on CPU using the following:

1 billion elements: 10 million rows, 100 columns
90% of data is used for training (9 million rows)
10% of data is used for validation (1 million rows)
500 training iterations
64 bins
CPU vs GPU modes: hist vs gpu_hist

We will use the following script to benchmark xgboost CPU vs GPU.

Benchmark results run by hand

The benchmark results run by hand are a bit different than the real benchmark:

We are using 12.5 million rows instead of 10 million rows and depth 6 only (fits better the 4GB GPU RAM of a Quadro P1000)
Other hardware is tested (i7–7700 + NVIDIA GeForce 1080, E5–1650 v3)

Tradeoff for using GPU

There are multiple tradeoffs for using xgboost on GPU:

GPU models are not reproducible: you will always get different results. If you are testing for lucky runs, then you will learn to at least run for the expected value (mean/average) over several runs. Or do statistical testing to compare means.
GPU models are not cleared from memory after being run: you need to remove the model from memory then run gc() .
xgboost crashes when using a lower number of threads than the number of available CPUs: use at least nthread equal to the number of GPUs used.
xgboost crashes when changing the (number of) GPUs used after training a model on an identical xgb.DMatrix: remove the dataset and model from memory, run gc() , and reconstruct the needed xgb.DMatrix…
xgboost GPU crashes for max_depth >X: use a maximum depth lower than or equal to X, otherwise you crash xgboost GPU. Rule of thumb I found: do not use more than 12 for approximately 100 features. The maximum depth for crashing seems to be linked to the number of features.
xgboost cross-validation with GPU crashes after training multiple folds: more likely you are running out of GPU RAM, then you should get what you want from the models, then delete the models.
xgboost ignores my hyperparameters: most likely you are using unavailable hyperparameters for GPU (not every hyperparameter is available with GPU, this is also true for fast histogram actually)

Example of non-reproducible results:

xgboost GPU is not reproducible and you will get different results every time! Only CPU is! — Note the CPU is not reproducible when changing computer: you need the same exact compiler to have reproducible results!

Benchmark results run by hand

For the benchmark results run by hand (12.5 million rows, 100 features), we have guest hardware provided by

Miguel Perez Michaus

:

Server 1: i7–7700 + 64GB RAM (4x 16GB RAM)+ NVIDIA GeForce 1080 8GB RAM
Server 2: E5–1650 v3 + 128GB RAM (4x 32GB RAM)

xgboost training time for 12,500,000 rows x 100 features, 500 iterations

Main conclusions to get here:

xgboost CPU with a very high end CPU (2x Xeon Gold 6154, 3.7 GHz all cores) is slower than xgboost GPU with a low-end GPU (1x Quadro P1000)
xgboost GPU seems to scale linearly
4 Quadro P1000 is faster than a single GeForce 1080

Extra conclusions, for those using CPU:

2x Xeon Gold 6154 (2x $3,543) gets you a training time of 700 seconds, 25% faster than a i7–7700 (for 2,339% the price) and 20% faster than a E5–1650 v3 (for 1,215% of the price)

How much should I spend for ML stuff?

Is it worth to purchase 2x Xeon Gold 6154 when you can purchase a i7–7700 (for 4% the price) to train as fast as possible a single model? It depends: if you value your time, then yes it is worth. Otherwise, it is a waste of your money (find other use cases to justify the 2x Gold 6154).

The random guy who says R is only working singlethreaded is wrong

If we can call you “The Parallelizer”, then you know with a 2x Gold 6154, you can train 72 xgboost fairly quickly at the same time. This is a godly working use case for your server. R works extremely well for parallelization, and it is available by default (and my package LauraeParallel provides load balancing in a functional programming fashion).

Also, a small note: do not buy a server just to boast. The next image proves it is just a waste of your money if you do not use it for a real task:

htop: Time to boast those “72 high performance threads on the server for super duper fast artificial intelligence machine learning data science crypto (insert more buzzwords)”

Don’t do that unless you want to show something interesting (like purchasing a server and keeping cores unused for instance…)

More specific benchmark results

I suppose you were here for the GPU benchmarks? If we plot the raw data, we may end up with something very wrong at first sight:

We have to understand several things from the plots, which are specific to our scenario (10M rows, 100 features):

Using no GPU is significantly slower.
CPU have negative scalability with a large number of threads, this is even more visible for larger maximum depths
More GPU means faster training (seems correct), but it does not seem to scale linearly (because the charting is actually so wrong visually)
More CPU threads using GPU is not faster, it is actually a flat line (in practice this is not exactly true when using too many threads for negative scalability, but for this experiment, we are keeping a flat line)

To focus on the essentials, we have to invert what we are actually: to get an idea of the speedup of using GPU against using CPU, we have to analyze the speedup against CPU. Three different charts for GPU speedup are provided:

Without free axes, and all data:

With free axes

With free axes, and restricted from 6 threads on CPU:

Better conclusions can be made using those charts:

A single GPU provides an excellent speedup against a small number of threads (5x or more)
Multiple GPUs provide a very huge speedup against a small number of threads (up to 20x)
GPU speedup decreases as the maximum depth increases
Against peak CPU performance, GPU performance increase remains flat (but is still an increase in performance)
Adding more GPUs increases the performance linearly as long as the maximum depth is lower, otherwise it increases the performance very maginally (see: depth 2 vs depth 12 scaling)

From this point of view, it is very easy to emit the following conclusion:

For small trees and if reproducibility is not an issue, using a weak GPU is faster than using a monster CPU, as long as the data fits in GPU RAM. Otherwise, using CPU remains the best choice.

Just reiterating the hypotheses in case for our GPU > CPU conclusion:

Small trees (small maximum depth)
Not reproducible results
Data fits in GPU RAM
Weak GPU > Strong CPU

If you can live with non-reproducible results to do cross-validation and compare feature performance, that’s another story where doing proper statistics can help you.

More about RAM usage

xgboost GPU is pretty smart at using multiple GPUs. By taking our benchmark script, and using a 500K rows x 100 features matrix with 10% as validation (343 MB training set), we get the following script:

Using nvidia-smi , you can exactly pinpoint the GPU RAM usage per process (we have to include the xgboost GPU process which takes 55 additional MB):

nvidia-smi showing exactly which process takes how many GPU RAM. A fixed 55MB (variable actually depending on GPUs used…) is mandatory in xgboost for GPU with R (might be identical for Python).

A more complete sample test script is below:

GPU RAM when modifying row count

We get the following RAM results when using xgboost GPU with a matrix of size 450,000 x 100:

How much GPU RAM is used per GPU for 450K row x 100 feature matrix?

How much total GPU RAM is used for 450K row x 100 feature matrix?

As we can see, the total GPU RAM used for GPU increases dramatically as the maximum depth increases:

The lowest GPU RAM usage is below depth 5 (between 1 and 4)
The GPU RAM usage spikes from depth 9
The GPU RAM usage for depth 12 is very high (3 times higher than the lowest RAM usage for our small data)

Let’s try again, but with a matrix of size 1,000,000 x 100 (763 MB):

How much GPU RAM is used per GPU for 1M row x 100 feature matrix?

What about 5,000,000 x 100, which is closer to the limit of 4GB (the matrix is of size 3,814.7 MB):

How much GPU RAM is used per GPU for 5M row x 100 feature matrix?

And 10,000,000 x 100, a matrix of size 7,629.4 MB?:

How much GPU RAM is used per GPU for 10M row x 100 feature matrix?

Step up the game to crash when using 1 GPU, let’s go for 25,000,000 x 100, a matrix of size 19,073.5 MB:

How much GPU RAM is used per GPU for 25M row x 100 feature matrix?

We can go ahead to crash when using 2 GPU, with a 50,000,000 x 100 matrix (38,147 MB):

How much GPU RAM is used per GPU for 50M row x 100 feature matrix?

Do you think a 75,000,000 x 100 matrix (57,220.5 MB) will work for 4 GPU? It crashed!

GPU RAM seems to increase by a fixed amount when using a larger depth, which explains why we may have thought our small data GPU RAM explodes with high depth, while it does not for large data.

GPU RAM when modifying feature count

1,000,000 x 500 (3,814.7 MB): it crashes after depth 10!

How much GPU RAM is used per GPU for 1M row x 500 feature matrix?

1,000,000 x 1,000 (7,629.4 MB): it crashes after depth 9!:

How much GPU RAM is used per GPU for 1M row x 1K feature matrix?

1,000,000 x 2,500 (19,073.5 MB): it crashes after depth 7, and requires at least 2 GPUs!:

How much GPU RAM is used per GPU for 1M row x 2.5K feature matrix?

1,000,000 x 5,000 (38,147 MB): it crashes after depth 6, and requires at least 4 GPUs!:

How much GPU RAM is used per GPU for 1M row x 5K feature matrix?

Adding more features (to equal the same number of elements by adding more observations) seem to cost more for xgboost GPU RAM than adding more observations.

Conclusions about GPU RAM usage

If we consider the number of elements in a matrix (number of rows x number of features), then we can conclude the following:

Multiple GPU scales pretty well (nearly linearly) for GPU RAM usage
More depth means higher GPU RAM usage
The number of features has a higher weight than the number of observations (roughly 5-15% more?): it requires more GPU RAM when adding more features than adding more observations for an equal number of elements in the matrix
Higher number of features increases the risk of crashing xgboost GPU when using a large maximum depth (a magic number seems to exist)
There seems to be a formula to predict the GPU RAM required depending on the number of observations, the number of features, and the maximum depth.

Mega Conclusion

This is a very simple conclusion:

xgboost GPU is fast. Very fast. As long as it fits in RAM and you do not care about getting reproducible results (and getting crashes).

To keep getting those epic, stable and reproducible results (or if data is just too big for GPU RAM), keep using the CPU. There’s no real workaround (yet).

xgboost GPU performance on low-end GPU vs high-end CPU

Potential criticism of using GPU xgboost

Hardware & Software

Compile xgboost for GPU in R

Monitoring GPUs in Linux

What do we benchmark?

Benchmark results run by hand

Tradeoff for using GPU

Benchmark results run by hand

More specific benchmark results

More about RAM usage

GPU RAM when modifying row count

GPU RAM when modifying feature count

Conclusions about GPU RAM usage

Mega Conclusion

Written by Laurae