Investigating xgboost Exact scalability

xgboost is a very well known Machine Learning technique based on Gradient Boosted Trees. The default xgboost is an exact method, which does not use pinning, and is significantly slower than the histogram-based version (fast histogram).

Two questions were asked recently about xgboost:

do I see it right that less threads than physical cores can provide fastes runtime?
interesting. so without fast hist, it’s a different story?
It is a whole new story.

Context

You may have seen my recent blog post (Getting the most of xgboost and LightGBM speed: Compiler, CPU pinning) which compares two compilers (Visual Studio and MinGW) and CPU pinning/roaming to find the best software setup configuration to run xgboost and LightGBM as fast as possible under Windows:

  • We should use Visual Studio to compile xgboost and LightGBM
  • CPU pinning seems useful for xgboost

The question of the day

What happens if we look now at xgboost exact? This is the topic of today, and we will go straight to the results as the benchmark setup is identical to the previous blog post.

The only differences are the following:

  • Exact xgboost instead of fast histogram
  • Every run is repeated twice

Benchmark results: from xgboost Exact to Fast Histogram

The Bosch dataset is very large: 6,000+ seconds is what a user should expect to spend training using the fastest available machine learning libraries at that time.

Big Data software (Hadoop / Spark etc.) is not what will save you from longer runtimes, what matters here are the algorithm and the performance optimizations. xgboost Exact can be viewed as approximately 10 times faster than R’s gbm and scikit-learn Gradient Boosting.

The Bosch dataset is very large for machine learning

As we directly look at the runtimes, we can find a large runtime slash when increasing the number of threads, between respectively:

  • Roaming CPU: 6,255.4s (1 thread) to 263.7s (22.7x faster)
  • Pinned CPU: 6,487.5s (1 thread) to 266.3s (23.4x faster)
xgboost literally takes forever to learn

xgboost seems to scale very well when adding more and more threads. We are going to qualify whether it scales very well or not in a more appropriate chart.

An ample reminder of the evolution between xgboost Exact and xgboost Fast histogram can be visualized below:

Approximately 8 threads and 120 seconds?!

With the histogram technique available first on LightGBM, xgboost Fast Histogram allows to slash the training time from 260 seconds (using a monster 56 threads) to 120 seconds (using a mere 8 threads only): a 117% performance increase for a similar model performance and using 7 times less computing power seems the best of the world!

LightGBM takes the crown of speed

LightGBM, the direct “concurrent” to xgboost, is significantly faster and drops the computation time from 120 seconds to 55 seconds for the same number of threads: a 118% performance increase.


xgboost Exact efficiency curve

So far we did some talk about the history of xgboost until we arrived to LightGBM. Let’s look at the computing efficiency of xgboost when scaling to more threads:

Ideally, we should have more than 28x efficiency using 56 threads

We can notice the following:

  • Not only xgboost Exact scales very well (over 2800% efficiency at 56 threads against a single thread)
  • But xgboost Exact still benefit a lot from hyperthreading (from threads 15 to 28, and 43 to 56, the efficiency keeps increasing)
  • And xgboost still manage to scale properly when NUMA issues arise (we are using a Dual Xeon, therefore managing memory improperly causes slowdowns)

We can’t say this is true for every situation, but this chart shows how well xgboost Exact is scaling when using it on large datasets.

We are counting in hours for a single thread, and in days for R’s gbm and Python’s scikit-learn.

Conclusion

xgboost Exact scales very well: this is a good example of a very well made program, tailored to scale on servers. Although xgboost is a sequential algorithm (Gradient Boosting is sequential, not parallel by nature), it still runs extremely fast when throwing more and more threads.

Recent advancements (throughout the last 4 years) slashed the training time from 100,000+ seconds to 50 seconds (2,000x, “two thousand” performance improvement) thanks to the following:

  • Parallelization / multithreading of the sequential task of Gradient Boosting
  • Code/Cache optimization of xgboost
  • Histogram/sketching idea of LightGBM

While in addition, improving then maintaining the high performance of the original machine learning algorithms. And also providing a proper and stable way to the industrialization of the algorithms (H2O still excels at this task).

The next part, if possible, will test LightGBM on dense data, as outlined here in GitHub.