xgboost GPU performance on low-end GPU vs high-end CPU
xgboost CPU with fast histogram is extremely fast compared to old school methods such as exact histogram.
How well does xgboost with very high-end CPU fare against a low-end GPU? Let’s find out, in a very unfair comparison.
GPU xgboost was implemented since last year to provide higher performance. Let’s see how much better here!
Note: to run the benchmark on GPU, you will need an NVIDIA GPU. There are no workarounds to that. NCCL is also mandatory for multi-GPU training.
CPUs vs CPUs: previous results
Here is a summary table of contents in case you are lost.
Content:
- Potential criticism of using GPU xgboost
- Hardware & Software
- What do we benchmark?
- Benchmark results run by hand
- More specific benchmark results
- More about RAM usage
Potential criticism of using GPU xgboost
You may have many criticism why (not) using GPU xgboost…
- GPU is not providing reproducible results: this is actually the truth in most cases. xgboost GPU does not provide reproducible results.
- GPU have less RAM than CPU: sorry, just purchase the expensive Titan / Quadro / Volta cards. Your business may thank you later (or fire you for reckless purchase orders).
- You can’t fit all data in GPU: sorry, but first you need to know you can use multiple GPUs, and second if the software you use is properly coded (and the algorithm behind allows it), the memory is shared across GPUs thanks to NCCL (distributed computing).
- Machine learning on GPU is good only for “deep learning”: sorry, but this is plain wrong. One could say “deep learning is not yet at the level of Johnny Depp”. Small refresher for you who think “deep learning > machine learning”:
- Mutlple GPUs do not scale. Just see how poor the performance using SLI on games!: you are comparing oranges and apples. This is the same as if you were comparing Geekbench on Android vs iOS (or Windows vs macOS): a pure total non sense.
- GPU does not work well when it is too fast: if you cool it sufficiently, it will work well.
- Why should I use xgboost on GPU when deep learning is always the best tool?: a tool is just a way to achieve an objective to meet a real need. In most cases, a neural network does not work well on tabular and business data. It’s similar to using a Bazooka to kill a bee!
- I can just use a cluster and do the work faster than everyone else in the world!: not really if there is no scalability and someone found the gem for a singlethreaded scenario.
- GPU is always the fastest tool for everything! No need to test!: sorry, it depends on the use case.
Hardware & Software
To compare xgboost CPU and GPU, we will be using the following unfair hardware worth over $15K:
- CPU: Dual Intel Xeon Gold 6154 (2x 18 cores / 36 threads, 3.7 GHz all turbo)
- RAM: 4x 64GB RAM 2666 MHz (good to go for 80 GBps bandwidth)
- GPU: 4x NVIDIA Quadro P1000 4GB RAM (very similar to NVIDIA GeForce 1050 4GB RAM, 4 of them is similar to a 1080)
- BIOS: NUMA enabled, Sub NUMA Clustering disabled
- Operating System: Pop!_OS 18.10 (like Ubuntu 18.10)
- R: 3.5.1, compiled with
-O3 -march=native
- NVIDIA versions: CUDA 10.0, NCCL 2.3.7
- xgboost version:
a2dc929
You might wonder, why comparing a miserable Quadro P1000 with a super high end CPU? You will find out later.
And yes, the following hardware you see below is slower than you may think (against our server):
Rationale: xgboost fast histogram does not scale well with threads. This was already seen so many times…
Compile xgboost for GPU in R
To compile xgboost in R with GPU support (and multi GPU support through NCCL), we can use a oneliner in R assuming you have the xgbdl package from myself:
xgbdl::xgb.dl(compiler = "gcc", commit = "a2dc929", use_avx = FALSE, use_gpu = TRUE, CUDA = list("/usr/lib/cuda", "/usr/bin/gcc-6", "/usr/bin/g++-6"), NCCL = "/usr/lib/x86_64-linux-gnu")
Note: AVX option is deprecated. Get rid of NCCL if you are using a single GPU. We assume you already installed CUDA and NCCL.
CUDA 10 requires gcc (version 6), and NCCL must be pointed to the right folder. xgb.dl
takes all those inputs for you, and perform the work on your behalf so you do not have to do it manually in R.
Installing xgboost for GPU allows you to keep using CPU. You are not restricted to only using GPU once installing the GPU version (but the CPU version allows you to only use CPU).
Monitoring GPUs in Linux
Other than using nvidia-smi
, you might be interested in nvtop:
PuTTY users, please use the following to run nvtop:
NCURSES_NO_UTF8_ACS=1 nvtop
Otherwise, you may have funny stuff.
I could not find something better than that (other than nvidia-smi
for more details “at X time”), if you have something interesting for GPU monitoring, feel free to share in the comments (do not recommend something like glances
, etc.: using a bazooka to solve a problem is not a solution).
What do we benchmark?
There is a very nice script created by kholitov to benchmark xgboost on GPU. We will adapt it to run on CPU using the following:
- 1 billion elements: 10 million rows, 100 columns
- 90% of data is used for training (9 million rows)
- 10% of data is used for validation (1 million rows)
- 500 training iterations
- 64 bins
- CPU vs GPU modes: hist vs gpu_hist
We will use the following script to benchmark xgboost CPU vs GPU.
Benchmark results run by hand
The benchmark results run by hand are a bit different than the real benchmark:
- We are using 12.5 million rows instead of 10 million rows and depth 6 only (fits better the 4GB GPU RAM of a Quadro P1000)
- Other hardware is tested (i7–7700 + NVIDIA GeForce 1080, E5–1650 v3)
Tradeoff for using GPU
There are multiple tradeoffs for using xgboost on GPU:
- GPU models are not reproducible: you will always get different results. If you are testing for lucky runs, then you will learn to at least run for the expected value (mean/average) over several runs. Or do statistical testing to compare means.
- GPU models are not cleared from memory after being run: you need to remove the model from memory then run
gc()
. - xgboost crashes when using a lower number of threads than the number of available CPUs: use at least
nthread
equal to the number of GPUs used. - xgboost crashes when changing the (number of) GPUs used after training a model on an identical xgb.DMatrix: remove the dataset and model from memory, run
gc()
, and reconstruct the needed xgb.DMatrix… - xgboost GPU crashes for
max_depth
>X: use a maximum depth lower than or equal to X, otherwise you crash xgboost GPU. Rule of thumb I found: do not use more than 12 for approximately 100 features. The maximum depth for crashing seems to be linked to the number of features. - xgboost cross-validation with GPU crashes after training multiple folds: more likely you are running out of GPU RAM, then you should get what you want from the models, then delete the models.
- xgboost ignores my hyperparameters: most likely you are using unavailable hyperparameters for GPU (not every hyperparameter is available with GPU, this is also true for fast histogram actually)
Example of non-reproducible results:
Benchmark results run by hand
For the benchmark results run by hand (12.5 million rows, 100 features), we have guest hardware provided by
:- Server 1: i7–7700 + 64GB RAM (4x 16GB RAM)+ NVIDIA GeForce 1080 8GB RAM
- Server 2: E5–1650 v3 + 128GB RAM (4x 32GB RAM)
Main conclusions to get here:
- xgboost CPU with a very high end CPU (2x Xeon Gold 6154, 3.7 GHz all cores) is slower than xgboost GPU with a low-end GPU (1x Quadro P1000)
- xgboost GPU seems to scale linearly
- 4 Quadro P1000 is faster than a single GeForce 1080
Extra conclusions, for those using CPU:
- 2x Xeon Gold 6154 (2x $3,543) gets you a training time of 700 seconds, 25% faster than a i7–7700 (for 2,339% the price) and 20% faster than a E5–1650 v3 (for 1,215% of the price)
How much should I spend for ML stuff?
Is it worth to purchase 2x Xeon Gold 6154 when you can purchase a i7–7700 (for 4% the price) to train as fast as possible a single model? It depends: if you value your time, then yes it is worth. Otherwise, it is a waste of your money (find other use cases to justify the 2x Gold 6154).
The random guy who says R is only working singlethreaded is wrong
If we can call you “The Parallelizer”, then you know with a 2x Gold 6154, you can train 72 xgboost fairly quickly at the same time. This is a godly working use case for your server. R works extremely well for parallelization, and it is available by default (and my package LauraeParallel provides load balancing in a functional programming fashion).
Also, a small note: do not buy a server just to boast. The next image proves it is just a waste of your money if you do not use it for a real task:
Don’t do that unless you want to show something interesting (like purchasing a server and keeping cores unused for instance…)
More specific benchmark results
I suppose you were here for the GPU benchmarks? If we plot the raw data, we may end up with something very wrong at first sight:
We have to understand several things from the plots, which are specific to our scenario (10M rows, 100 features):
- Using no GPU is significantly slower.
- CPU have negative scalability with a large number of threads, this is even more visible for larger maximum depths
- More GPU means faster training (seems correct), but it does not seem to scale linearly (because the charting is actually so wrong visually)
- More CPU threads using GPU is not faster, it is actually a flat line (in practice this is not exactly true when using too many threads for negative scalability, but for this experiment, we are keeping a flat line)
To focus on the essentials, we have to invert what we are actually: to get an idea of the speedup of using GPU against using CPU, we have to analyze the speedup against CPU. Three different charts for GPU speedup are provided:
- Without free axes, and all data:
- With free axes
- With free axes, and restricted from 6 threads on CPU:
Better conclusions can be made using those charts:
- A single GPU provides an excellent speedup against a small number of threads (5x or more)
- Multiple GPUs provide a very huge speedup against a small number of threads (up to 20x)
- GPU speedup decreases as the maximum depth increases
- Against peak CPU performance, GPU performance increase remains flat (but is still an increase in performance)
- Adding more GPUs increases the performance linearly as long as the maximum depth is lower, otherwise it increases the performance very maginally (see: depth 2 vs depth 12 scaling)
From this point of view, it is very easy to emit the following conclusion:
For small trees and if reproducibility is not an issue, using a weak GPU is faster than using a monster CPU, as long as the data fits in GPU RAM. Otherwise, using CPU remains the best choice.
Just reiterating the hypotheses in case for our GPU > CPU conclusion:
- Small trees (small maximum depth)
- Not reproducible results
- Data fits in GPU RAM
- Weak GPU > Strong CPU
If you can live with non-reproducible results to do cross-validation and compare feature performance, that’s another story where doing proper statistics can help you.
More about RAM usage
xgboost GPU is pretty smart at using multiple GPUs. By taking our benchmark script, and using a 500K rows x 100 features matrix with 10% as validation (343 MB training set), we get the following script:
Using nvidia-smi
, you can exactly pinpoint the GPU RAM usage per process (we have to include the xgboost GPU process which takes 55 additional MB):
A more complete sample test script is below:
GPU RAM when modifying row count
We get the following RAM results when using xgboost GPU with a matrix of size 450,000 x 100:
As we can see, the total GPU RAM used for GPU increases dramatically as the maximum depth increases:
- The lowest GPU RAM usage is below depth 5 (between 1 and 4)
- The GPU RAM usage spikes from depth 9
- The GPU RAM usage for depth 12 is very high (3 times higher than the lowest RAM usage for our small data)
Let’s try again, but with a matrix of size 1,000,000 x 100 (763 MB):
What about 5,000,000 x 100, which is closer to the limit of 4GB (the matrix is of size 3,814.7 MB):
And 10,000,000 x 100, a matrix of size 7,629.4 MB?:
Step up the game to crash when using 1 GPU, let’s go for 25,000,000 x 100, a matrix of size 19,073.5 MB:
We can go ahead to crash when using 2 GPU, with a 50,000,000 x 100 matrix (38,147 MB):
Do you think a 75,000,000 x 100 matrix (57,220.5 MB) will work for 4 GPU? It crashed!
GPU RAM seems to increase by a fixed amount when using a larger depth, which explains why we may have thought our small data GPU RAM explodes with high depth, while it does not for large data.
GPU RAM when modifying feature count
1,000,000 x 500 (3,814.7 MB): it crashes after depth 10!
1,000,000 x 1,000 (7,629.4 MB): it crashes after depth 9!:
1,000,000 x 2,500 (19,073.5 MB): it crashes after depth 7, and requires at least 2 GPUs!:
1,000,000 x 5,000 (38,147 MB): it crashes after depth 6, and requires at least 4 GPUs!:
Adding more features (to equal the same number of elements by adding more observations) seem to cost more for xgboost GPU RAM than adding more observations.
Conclusions about GPU RAM usage
If we consider the number of elements in a matrix (number of rows x number of features), then we can conclude the following:
- Multiple GPU scales pretty well (nearly linearly) for GPU RAM usage
- More depth means higher GPU RAM usage
- The number of features has a higher weight than the number of observations (roughly 5-15% more?): it requires more GPU RAM when adding more features than adding more observations for an equal number of elements in the matrix
- Higher number of features increases the risk of crashing xgboost GPU when using a large maximum depth (a magic number seems to exist)
- There seems to be a formula to predict the GPU RAM required depending on the number of observations, the number of features, and the maximum depth.
Mega Conclusion
This is a very simple conclusion:
xgboost GPU is fast. Very fast. As long as it fits in RAM and you do not care about getting reproducible results (and getting crashes).
To keep getting those epic, stable and reproducible results (or if data is just too big for GPU RAM), keep using the CPU. There’s no real workaround (yet).