Getting the most of xgboost and LightGBM speed: Compiler, CPU pinning

Why should I change my computer setup if it works? To remove 1/3 of your time spending waiting for results!!!

Currently, xgboost and LightGBM are the two best performing machine learning algorithms for large datasets (both in speed and metric performance). They scale very well up to billion of observations and/or elements (ex: Reputation dataset, 53,181,000,000 elements).

xgboost and LightGBM were made primarily for speed: it is better to iterate quickly at high accuracy to try more different things, than waiting your neural network to finish after hours.

However, although they can be used on large datasets, the question of scalability was partially answered: how well xgboost and LightGBM are scaling? Do they prefer high frequency cores or more cores?

  • xgboost exact likes both many cores and high frequency, with a preference on both
  • xgboost fast histogram needs high frequency
  • LightGBM likes both many cores and high frequency, with a preference on high frequency

As we already know the answer to this question, we are going to look up for a more exotic situation: changing the compiler, and pinning CPU.

Are xgboost and LightGBM faster by swapping the compiler from MinGW to Visual Studio? Is CPU pinning a good thing to do?

This was also partially answered in this GitHub issue. Therefore, we are back with our Windows machine to do some benchmarks.

Interactive documents:

In the conclusion, an opening to GPU xgboost was included.


A quick review on the definition of a compiler and CPU pinning

Defining a compiler and CPU pinning

The place of the compiler for a source code and an executable
Haswell EP Xeon CPU die configuration: there are four RAM banks which does not have the same latency if you take two different group of cores!
  • Compiler: the compiler transforms the code of a source language into a code of a target language (usually to generate an executable). They are similar to a translator, and we all know translators do not have the same level of performance: some are providing gibberish words, some are providing excellent translations, which in turns make your interpretation of words slower or quicker.
  • CPU pinning: CPU pinning is the binding of a process (or thread) to a specific range of CPU cores. This way, the process will not roam anywhere as easily as it could without CPU pinning. When the process roams across CPUs, it incurs significantly higher RAM and cache latency: this is even more severe with multi-socket CPUs.

CPU pinning is also named CPU affinity, although the wording is inexact (“affinity” could mean “preference”, although it is not in this case: it is “this process uses this range and only this range of CPU cores”).


Benchmarking the differences

We are going to benchmark the difference between compilers and CPU pinning, for each number of threads available (1 to 56) on our server:

  • Two compilers to test: Visual Studio (Windows’ native) and MinGW (gcc)
  • Two CPU behaviors: CPU roaming (no pinning) and CPU pinning (by socket, then by physical core, then by hyperthreaded core).

The latter means the following: if we have 2 sockets, 4 physical cores on each socket, and hyperthreaded activated, we will try to contain all CPUs in one socket, first adding physical (yellow) cores, then adding logical (orange) cores:

Activation order of CPUs: 1, 3, 5, 7, 2, 4, 6, 8, 9, 11, 13, 15, 10, 12, 14, 16

We are benchmarking xgboost and LightGBM under the following environment:

  • CPU: Dual Intel Xeon E5–2697v3 (14 cores, 28 threads, 3.6 GHz singlethread, 3.1 GHz multithread)
  • RAM: 128GB RAM DDR4 2133 MHz
  • GPU: none
  • OS: Windows Server 2012 R2 Datacenter, without Meltdown/Spectre patch
  • R version: default 3.4.3
  • Compiler: Visual Studio 2017, MinGW 4.9 (R)
  • xgboost: commit 3f3f54b (Jan 16, 2018, 5:16 PM GMT+1)
  • LightGBM: commit 3dc5716 (Jan 18, 2018, 2:16 AM GMT+1)

The dataset:

The algorithm parameters:

  • Number of boosting iterations: 200
  • Learning rate: 0.05
  • Maximum depth: 8
  • Maximum leaves: 255
  • Max bins: 255
  • Minimum hessian: 1
  • xgboost only: fast histogram, depth-wise
  • LightGBM only: minimum split loss of 1 (due to loss-guided optimization)

Each run were repeated at least twice, up to 10 times. It took approximately 1 week to run the benchmark, thanks to having so many threads!!!


Benchmark Results

Reminder: xgboost and LightGBM does not scale linearly at all.

xgboost is up to 154% faster than a single thread, while LightGBM is up to 1,116% faster than a single thread.

If you have a workstation…:

  • If you have 56 threads, do not expect that 56 threads to be 5,500% more efficient than 1 thread (it will not train 55x times faster).
  • If you have 28 cores, do not expect that 28 threads to be 2,700% more efficient than 1 thread (it will not train 27x times faster).
  • If you have a small dataset, do not expect lot of threads to scale well (it will negatively scale).

Showing the results taking the best case scenario (Visual Studio, Roaming CPUs) below:


Compiler Performance

By far, Visual Studio is the compiler to go on Windows. It is worth installing Visual C++ Build Tools to get the fastest training speed possible.

With roaming CPUs:

xgboost is very fast using Visual Studio instead of MinGW/gcc
LightGBM is a bit faster with Visual Studio instead of MinGW/gcc. Keep in mind, unfortunately, the MinGW slowdown happens at large depth.

With CPU pinning:

xgboost with MinGW depicts huge RAM latencies when spreading the CPU pinning on the physical cores and using 2 sockets at the same time.
LightGBM still likes more Visual Studio over MinGW.

CPU pinning Performance

CPU pinning increases the performance of xgboost with MinGW significantly. Otherwise, we are seeing performance degradation.

Story morale:

  • Use CPU pinning if you are using xgboost with MinGW.
  • Another case: if you are training parallel xgboost and LightGBM on the same machine, pin the CPUs in order to make sure CPU cache effects can trigger properly (ex: if you are training 4 xgboost models at the same time on a 4 core machine, pin each model process to a separate core).

With Visual Studio:

xgboost with Visual Studio requires CPU pinning for performance increases.
LightGBM seems faster without CPU pinning. Strange?

With MinGW:

With MinGW, xgboost does not need CPU pinning IT SEEMS.
LightGBM does not need CPU pinning also IT SEEMS.

Conclusion

Using Visual Studio without CPU pinning seems the best choice by far.

The recommendations for the power users wanting the most of their xgboost/LightGBM:

  • Use Visual Studio whenever possible
  • Train models without CPU pinning
  • And attempt to get higher CPU frequencies…

If you were forced to use xgboost in Windows, then force CPU pinning to increase the performance.

If you have single models to train, GPU xgboost seems the way to go due to how stable it became today. You do not even need a powerful server, even a laptop’s NVIDIA 1050 Ti outperforms our monster server.

NVIDIA 1050 Ti + GPU xgboost is FAST!
For curious, using a NVIDIA 1050 Ti (1.75 GHz) on a laptop with GPU xgboost, it takes 92 seconds to train a model. That’s 28 seconds faster than the fastest xgboost (Visual Studio + CPU pinning + 9 physical cores). An overclocked workstation would slash that time to about 60 seconds.

Find below the most brutal comparison in efficiency, when using xgboost and CPU pinning:

Which one do you prefer? A tool with 349% efficiency or a tool with 180% efficiency? The answer is very easy!