LightGBM on Windows: Visual Studio vs MinGW (gcc), R with Visual Studio

Laurae
Data Science & Design
4 min readJun 10, 2017

Thinking on using LightGBM on Windows? You know you are given two hard choices: Visual Studio or MinGW (gcc).

Visual Studio 2017 is alone a whooping 2GB, excluding external dependencies.

But everyone knows Visual Studio is a pain to install. Even the Microsoft Build toolset does not alleviate the pain of having a large download to do before even being able to compile something.

Even though the installation is about 2GB for Visual Studio 2017 (because you may want the GUI to test R/Python integration after all), it is significantly better than the previous 8GB for Visual Studio 2015!

Meanwhile, with MinGW (x86_64-posix-seh, aka 64-bit + posix threads + seh debug), a simple 50MB file to download and extract eases the life!

MinGW x86_64-posix-seh is big? Think again.

But are you losing something when using MinGW and going the “easy way”? This is what we are going to check (quickly)…

What is sparking the need to check for Visual Studio vs MinGW?

I think you will understand visually, there is no need to explain.

MinGW/gcc (left) vs Visual Studio (right): CPU usage different under the same settings, but with only a difference: the compiler?

It becomes obvious from this comparison picture that we have a major issue with MinGW/gcc: the CPUs are not busy enough on large datasets, while Visual Studio maintains all cores busy!

Some benchmark comparisons of Visual Studio and MinGW

You can find all detailed benchmarks on the following links:

Laptop benchmark (2 physical cores)

My main laptop has a i7–4600U CPU with 16GB RAM. We can check very quickly its performance on Bosch dataset (1M observation and 1K features dataset), which fits nicely in our RAM.

We are testing LightGBM under the following scenarii:

  • Visual Studio 2017 on CLI (master)
  • MinGW 7.1 on R (master and v2.0)
  • MinGW 7.1 on CLI (master and v2.0)

Unexpectedly, Visual Studio is slower than MinGW. For small number of threads, it seems MinGW is better (even with R callback and processing overhead) than Visual Studio.

When comparing CLIs (Visual Studio and MinGW), the difference is a well-sized 5%.

R overhead is approximately 3% of the computation time.

Server benchmark (20 physical cores)

My main server has a Dual Xeon Ivy Bridge (Quanta Freedom) with 80GB RAM allocated to a virtual machine. Performance checkup is done again on Bosch dataset.

We notice quickly the more threads we throw, the more performance we have. The difference is so heavy that it reaches:

  • 15% worse for not using hyperthreaded cores
  • Up to 40% worse for using MinGW and not hyperthreaded cores instead of Visual Studio with hyperthreaded cores

It is obvious who is the winner here: Visual Studio.

Versus xgboost?

Just for eyes obviously, using my laptop with 2 physical cores (4 threads):

xgboost (fast histogram) has bridged the performance gap with LightGBM. They are only 5% apart in this case.

Conclusion

A quick conclusion could be the following:

Windows users should use MinGW for LightGBM when they are using low-end machines, such as laptops with 2 cores only. When reaching more cores (like 4 physical cores), it is recommended to use Visual Studio to reach maximum performance.

This is the reason the pull request “Compile R package by custom tool chain” is existing: if you have a high performance tool, then make sure you are using that high performance at its fullest! It means in our case: compile with Visual Studio, but use in R.

“I have no idea what I’m doing” meme

Apparently, it also eases the installation, especially for Mac OS users.

If you do not know what are you doing, use Visual Studio.

This is as simple as doing a simple math addition: setup your PATH environment variable correctly!

--

--