Benchmarking new xgboost fast histogram: xgboost and the compiler story

We have seen previously that the new xgboost fast histogram method had an issue: it was awfully slow. But we fixed it. By recompiling R with gcc 7.1.

How do you call someone compiling R from scratch in Windows?

Compiling R was something tough, but I have now an executable I can use on all my servers to deploy R with gcc 7.1 without any issue:

  • Even better, all libraries are compiled with gcc 7.1!
  • It makes the new xgboost fast histogram fly!

Therefore, we are going to benchmark two different things from xgboost:

  • xgboost old fast histogram with gcc 4.9 (Rtools) and gcc 7.1 (MinGW)
  • xgboost new fast histogram with gcc 7.1 (MinGW)

Comparing xgboost old fast histogram with gcc 4.9 and gcc 7.1

To compare the xgboost old fast histogram with different compilers, we will use:

  • R/xgboost compiled with gcc 4.9
  • R/xgboost compiled with gcc 7.1

And no, do not tell me to compile it with something else. It is already difficult enough to compile R in Windows.


Intel i7–3930K: gcc 4.9 vs gcc 7.1

tl;dr: gcc 7.1 wins.

Normalization per thread comparison:

  • gcc 7.1 is the winner overall.

Cumulated Normalization per thread comparison:

  • gcc 7.1 clearly wins.

Detailed Data Chart:

  • gcc 7.1 is the winner 11 times out of 12.

Intel i7–7700K: gcc 4.9 vs gcc 7.1*

tl;dr: gcc 4.9 wins but… (* read conclusion before making conclusions, there was a linux kernel version issue)

Normalization per thread comparison:

  • gcc 4.9 is the winner overall.

Cumulated Normalization per thread comparison:

  • gcc 4.9 clearly wins.

Detailed Data Chart:

  • gcc 4.9 is the winner 100% of times (8 out of 8).

Dual Quanta Freedom Ivy Bridge: gcc 4.9 vs gcc 7.1

tl;dr: gcc 7.1 wins.

Normalization per thread comparison:

  • gcc 7.1 is the winner overall.

Cumulated Normalization per thread comparison:

  • gcc 7.1 learly wins.

Detailed Data Chart:

  • gcc 7.1 is the winner 18 times out of 20.

Conclusion about gcc and xgboost old fast histogram

  • i7–3930K: gcc 7.1 won
  • i7–7700K: gcc 4.9 won*
  • 20 core server: gcc 7.1 won

In the case of the i7–7700K, I reinstalled the whole virtualization machine (host machine) which means it also changed the linux kernel (4.10 for gcc 4.9, 4.9 for gcc 7.1). Running the same benchmark with kernel 4.9 and gcc 4.9 leads to 7.1 winning 100%.

gcc 7.1 won “all the times” against gcc 4.9.

So the real conclusion would be… gcc 7.1 “won all times” (if not losing a little bit somewhere).


Comparing xgboost fast histogram: old vs new

Now we will be interested into comparing xgboost fast histogram old and new versions. Will the new version reign supreme? This is what we will check.

You can install the used xgboost versions using the commands below:

  • old xgboost fast histogram: devtools::install_github("Laurae2/ez_xgb/R-package@2017-02-15-v1")
  • new xgboost fast histogram: devtools::install_github("Laurae2/ez_xgb/R-package@2017-05-02-v2")

I think I will not even have to comment, results are obvious.


Intel i7–3930K: old vs new xgboost fast histogram

tl;dr: new fast histogram wins.

Normalization per thread comparison:

Cumulated Normalization per thread comparison:

Detailed Data Chart:


Intel i7–7700K: old vs new xgboost fast histogram

tl;dr: new fast histogram wins.

Normalization per thread comparison:

Cumulated Normalization per thread comparison:

Detailed Data Chart:


Dual Quanta Freedom Ivy Bridge: old vs new xgboost fast histogram

tl;dr: new fast histogram wins.

Normalization per thread comparison:

Cumulated Normalization per thread comparison:

Detailed Data Chart:


Old vs New Fast Histogram: all servers together

Need to compare the performance visually with big charts? Here you are served:

  • i7–7700K is just the “KING” (or the QUEEN if you want it that way)
  • The new xgboost fast histogram is just smoking everything

Clearly, going over 1 thread is already providing a poor ROI (return on investment, but applied to CPU threads). For instance, on i7–7700K, you better do a 4-fold cross-validation using a parallelized cross-validation:

  • Parallelized cross-validation: less than 5 minutes for doing 4 parallel trainings using 1 thread each.
  • Sequential cross-validation: about 12 minutes for doing a training one by one using 3 threads each (assuming you found out the sweet spot).
Did you ever wanted to get a cross-validation speedup?
Assuming you have enough RAM, here you have it.

Conclusion

VERY simple key takeways:

  • New xgboost fast histogram is crushing everything.
  • 1 thread new xgboost fast histogram is 75% faster than the old xgboost fast histogram.
  • gcc 7.1 is approximately 3% faster than gcc 4.9 for xgboost fast histogram.

Still using the old xgboost fast histogram? Switch to the new one!

But are you satisfied enough?

We have ONE blog post which will follow this series:

  • Benchmarking Baremetal Linux vs Virtualized Windows: how slow are we? AMD Ryzen showing up!

Previous post in this series:

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.