This post is about benchmarking LightGBM and xgboost (exact method) on a customized Bosch data set. I have seen xgboost being 10 times slower than LightGBM during the Bosch competition, but now we got back with some numbers to compare! Our next benchmark will be about the Fast Histogram method of xgboost. The setup used is identical to the first benchmarks on Bosch + xgboost I made.
Cross-posted on Imploding Gradients.
Let’s go straight to the chart, this should alleviate all the impatience!
As we can see, in average LightGBM (binning) is between 11x to 15x faster than xgboost (without binning).
We also notice the ratio gets smaller when using many threads: this is obvious as if you cannot keep the threads busy 100%, then threading inefficiency kicks in (some threads may be forced to wait idle because the processing for the scheduling for the next task is not fast enough).
Let’s get a look at the 12 first threads.
What we can notice for xgboost is that we have performance gains by going over 6 physical cores (using 12 logical cores helps by about 28.3%, going from 577.9 seconds to 414.3 seconds).
Is this the same for LightGBM? Yes! We dropped from 45.1 seconds to 33.6 seconds, which is a massive performance gain (25.5%).
Conclusion for this part: use all logical cores for threading, this helps tremendously. If you want your machine learning training pipeline to end about 25% faster (varies by CPUs, obviously), you now know what to do: use logical cores instead of physical cores for thread count.
What if we look specifically for 13 to 24 threads? We add up to 12 threads as a reference for comparison.
We can notice quickly:
- No improvements for xgboost, more or less noisy variance
- Inverse improvements for LightGBM, with increased boosting time (from 33.6 seconds to up 38+ seconds)
Therefore, a quick conclusion is the following: do not overallocate logical cores, it is not a good practice. Keep using logical cores as the thread count, and do not go over that number.
Quick look at LightGBM specifically
We can do a quick look at LightGBM curve.
This seems to be a linear improvement: from 202 seconds (1 full core used, 1 thread), we dropped to 33.6 seconds (6 full cores used, 12 threads), which is nearly a 100% multithreading efficiency. As we hit the wall with more threads, the multithreading efficiency lowers drastically and we have those inverse improvements.
Data RAM efficiency?
A quick look at the RAM usage depicts the following, using gc() twice after the creation of the matrices:
- Initial data (dense, unused): approx 8,769 MB (27.9% vs original)
- Original data (dgCMatrix): approx. 2,448 MB (100% vs original)
- xgboost (xgb.DMatrix): approx. 1,701 MB (69.5% vs original)
- LightGBM (lgb.Dataset): approx. 2,512 MB (102.6% vs original)
It seems LightGBM has a higher memory footprint than xgboost.
Training RAM efficiency
We are using 12 threads to check the RAM efficiency, taken at the end of the 50 boosting iterations, using gc before boosting, not using gc after boosting:
- xgboost: approx. 1684 MB
- LightGBM: approx. 1425 MB (84.6% of xgboost memory usage)
We can notice LightGBM has a lower RAM usage during training, at the cost of an increased RAM usage for the data in memory. There could be improvements in the potential modifications of the R LightGBM package to have a more efficiency way to store data.
The next benchmark will come when xgboost’s fast histogram method is up and running and usable in R. It is currently up and running, but not usable in R. This would be the closest “Apples to Apples” comparison between xgboost and LightGBM.
We will be also comparing the logarithmic loss between xgboost and LightGBM.