Data Science & Design - Medium

xgboost GPU performance on low-end GPU vs high-end CPU

Laurae — Sun, 30 Dec 2018 14:01:00 GMT

xgboost CPU with fast histogram is extremely fast compared to old school methods such as exact histogram.

How well does xgboost with very high-end CPU fare against a low-end GPU? Let’s find out, in a very unfair comparison.

GPU xgboost was implemented since last year to provide higher performance. Let’s see how much better here!

Note: to run the benchmark on GPU, you will need an NVIDIA GPU. There are no workarounds to that. NCCL is also mandatory for multi-GPU training.

CPUs vs CPUs: previous results

Here is a summary table of contents in case you are lost.

Content:

Potential criticism of using GPU xgboost
Hardware & Software
What do we benchmark?
Benchmark results run by hand
More specific benchmark results
More about RAM usage

Potential criticism of using GPU xgboost

You may have many criticism why (not) using GPU xgboost…

GPU is not providing reproducible results: this is actually the truth in most cases. xgboost GPU does not provide reproducible results.

Logloss of xgboost models for the benchmark run by hand for 12.5 million rows x 100 features. GPU xgboost seesm to provide a random best logloss. Only CPU xgboost provides reproducible results.

GPU have less RAM than CPU: sorry, just purchase the expensive Titan / Quadro / Volta cards. Your business may thank you later (or fire you for reckless purchase orders).

NVIDIA RTX meme: this is not professional

NVIDIA DGX-2 refresh ($490K price tag on a 36 month lease): 32GB RAM per GPU!!!

You can’t fit all data in GPU: sorry, but first you need to know you can use multiple GPUs, and second if the software you use is properly coded (and the algorithm behind allows it), the memory is shared across GPUs thanks to NCCL (distributed computing).

NCCL example: https://devblogs.nvidia.com/fast-multi-gpu-collectives-nccl/

Machine learning on GPU is good only for “deep learning”: sorry, but this is plain wrong. One could say “deep learning is not yet at the level of Johnny Depp”. Small refresher for you who think “deep learning > machine learning”:

Deep learning is just a very small subset of machine learning. LOL !!!!!!!!!!!!

Mutlple GPUs do not scale. Just see how poor the performance using SLI on games!: you are comparing oranges and apples. This is the same as if you were comparing Geekbench on Android vs iOS (or Windows vs macOS): a pure total non sense.

Geekbench 4 multithreaded test: Same CPU spotted with very varying results (Intel Xeon W-2191B: 47240 on iMac Pro, 35536 on Windows?)

Singlethreading meme

GPU does not work well when it is too fast: if you cool it sufficiently, it will work well.

GPU toaster: not giving enough cooling to your toaster will make it die.

Why should I use xgboost on GPU when deep learning is always the best tool?: a tool is just a way to achieve an objective to meet a real need. In most cases, a neural network does not work well on tabular and business data. It’s similar to using a Bazooka to kill a bee!

Are you deeper than deep learning?

I can just use a cluster and do the work faster than everyone else in the world!: not really if there is no scalability and someone found the gem for a singlethreaded scenario.

Optimize first the inner most element, until you can’t anymore: switch to the outer element, and do it again over and over.

GPU is always the fastest tool for everything! No need to test!: sorry, it depends on the use case.

Do you even need a nuclear radiation protection for your smartphone? Better get Helium protection first.

Hardware & Software

To compare xgboost CPU and GPU, we will be using the following unfair hardware worth over $15K:

CPU: Dual Intel Xeon Gold 6154 (2x 18 cores / 36 threads, 3.7 GHz all turbo)
RAM: 4x 64GB RAM 2666 MHz (good to go for 80 GBps bandwidth)
GPU: 4x NVIDIA Quadro P1000 4GB RAM (very similar to NVIDIA GeForce 1050 4GB RAM, 4 of them is similar to a 1080)
BIOS: NUMA enabled, Sub NUMA Clustering disabled
Operating System: Pop!_OS 18.10 (like Ubuntu 18.10)
R: 3.5.1, compiled with -O3 -march=native
NVIDIA versions: CUDA 10.0, NCCL 2.3.7
xgboost version: a2dc929

You might wonder, why comparing a miserable Quadro P1000 with a super high end CPU? You will find out later.

And yes, the following hardware you see below is slower than you may think (against our server):

Beware: dual Xeon E5–2699v4 (44 cores, 88 threads, 2.8 GHz all turbo) is slower overall!

Rationale: xgboost fast histogram does not scale well with threads. This was already seen so many times…

Compile xgboost for GPU in R

To compile xgboost in R with GPU support (and multi GPU support through NCCL), we can use a oneliner in R assuming you have the xgbdl package from myself:

xgbdl::xgb.dl(compiler = "gcc", commit = "a2dc929", use_avx = FALSE, use_gpu = TRUE, CUDA = list("/usr/lib/cuda", "/usr/bin/gcc-6", "/usr/bin/g++-6"), NCCL = "/usr/lib/x86_64-linux-gnu")

Note: AVX option is deprecated. Get rid of NCCL if you are using a single GPU. We assume you already installed CUDA and NCCL.

CUDA 10 requires gcc (version 6), and NCCL must be pointed to the right folder. xgb.dl takes all those inputs for you, and perform the work on your behalf so you do not have to do it manually in R.

Installing xgboost for GPU allows you to keep using CPU. You are not restricted to only using GPU once installing the GPU version (but the CPU version allows you to only use CPU).

Monitoring GPUs in Linux

Other than using nvidia-smi, you might be interested in nvtop:

nvtop is like htop but for GPUs!

PuTTY users, please use the following to run nvtop:

NCURSES_NO_UTF8_ACS=1 nvtop

Otherwise, you may have funny stuff.

I could not find something better than that (other than nvidia-smi for more details “at X time”), if you have something interesting for GPU monitoring, feel free to share in the comments (do not recommend something like glances , etc.: using a bazooka to solve a problem is not a solution).

What do we benchmark?

There is a very nice script created by kholitov to benchmark xgboost on GPU. We will adapt it to run on CPU using the following:

1 billion elements: 10 million rows, 100 columns
90% of data is used for training (9 million rows)
10% of data is used for validation (1 million rows)
500 training iterations
64 bins
CPU vs GPU modes: hist vs gpu_hist

We will use the following script to benchmark xgboost CPU vs GPU.

https://medium.com/media/9dbb6cd398539e294c67000f01341abc/href

Benchmark results run by hand

The benchmark results run by hand are a bit different than the real benchmark:

We are using 12.5 million rows instead of 10 million rows and depth 6 only (fits better the 4GB GPU RAM of a Quadro P1000)
Other hardware is tested (i7–7700 + NVIDIA GeForce 1080, E5–1650 v3)

Tradeoff for using GPU

There are multiple tradeoffs for using xgboost on GPU:

GPU models are not reproducible: you will always get different results. If you are testing for lucky runs, then you will learn to at least run for the expected value (mean/average) over several runs. Or do statistical testing to compare means.
GPU models are not cleared from memory after being run: you need to remove the model from memory then run gc() .
xgboost crashes when using a lower number of threads than the number of available CPUs: use at least nthread equal to the number of GPUs used.
xgboost crashes when changing the (number of) GPUs used after training a model on an identical xgb.DMatrix: remove the dataset and model from memory, run gc() , and reconstruct the needed xgb.DMatrix…
xgboost GPU crashes for max_depth >X: use a maximum depth lower than or equal to X, otherwise you crash xgboost GPU. Rule of thumb I found: do not use more than 12 for approximately 100 features. The maximum depth for crashing seems to be linked to the number of features.
xgboost cross-validation with GPU crashes after training multiple folds: more likely you are running out of GPU RAM, then you should get what you want from the models, then delete the models.
xgboost ignores my hyperparameters: most likely you are using unavailable hyperparameters for GPU (not every hyperparameter is available with GPU, this is also true for fast histogram actually)

Example of non-reproducible results:

xgboost GPU is not reproducible and you will get different results every time! Only CPU is! — Note the CPU is not reproducible when changing computer: you need the same exact compiler to have reproducible results!

Benchmark results run by hand

For the benchmark results run by hand (12.5 million rows, 100 features), we have guest hardware provided by Miguel Perez Michaus:

Server 1: i7–7700 + 64GB RAM (4x 16GB RAM)+ NVIDIA GeForce 1080 8GB RAM
Server 2: E5–1650 v3 + 128GB RAM (4x 32GB RAM)

xgboost training time for 12,500,000 rows x 100 features, 500 iterations

Main conclusions to get here:

xgboost CPU with a very high end CPU (2x Xeon Gold 6154, 3.7 GHz all cores) is slower than xgboost GPU with a low-end GPU (1x Quadro P1000)
xgboost GPU seems to scale linearly
4 Quadro P1000 is faster than a single GeForce 1080

Extra conclusions, for those using CPU:

2x Xeon Gold 6154 (2x $3,543) gets you a training time of 700 seconds, 25% faster than a i7–7700 (for 2,339% the price) and 20% faster than a E5–1650 v3 (for 1,215% of the price)

How much should I spend for ML stuff?

Is it worth to purchase 2x Xeon Gold 6154 when you can purchase a i7–7700 (for 4% the price) to train as fast as possible a single model? It depends: if you value your time, then yes it is worth. Otherwise, it is a waste of your money (find other use cases to justify the 2x Gold 6154).

The random guy who says R is only working singlethreaded is wrong

If we can call you “The Parallelizer”, then you know with a 2x Gold 6154, you can train 72 xgboost fairly quickly at the same time. This is a godly working use case for your server. R works extremely well for parallelization, and it is available by default (and my package LauraeParallel provides load balancing in a functional programming fashion).

Also, a small note: do not buy a server just to boast. The next image proves it is just a waste of your money if you do not use it for a real task:

htop: Time to boast those “72 high performance threads on the server for super duper fast artificial intelligence machine learning data science crypto (insert more buzzwords)”

Don’t do that unless you want to show something interesting (like purchasing a server and keeping cores unused for instance…)

More specific benchmark results

I suppose you were here for the GPU benchmarks? If we plot the raw data, we may end up with something very wrong at first sight:

Using no GPU seems so slow!!!!!

We have to understand several things from the plots, which are specific to our scenario (10M rows, 100 features):

Using no GPU is significantly slower.
CPU have negative scalability with a large number of threads, this is even more visible for larger maximum depths
More GPU means faster training (seems correct), but it does not seem to scale linearly (because the charting is actually so wrong visually)
More CPU threads using GPU is not faster, it is actually a flat line (in practice this is not exactly true when using too many threads for negative scalability, but for this experiment, we are keeping a flat line)

To focus on the essentials, we have to invert what we are actually: to get an idea of the speedup of using GPU against using CPU, we have to analyze the speedup against CPU. Three different charts for GPU speedup are provided:

Without free axes, and all data:

With free axes

With free axes, and restricted from 6 threads on CPU:

Better conclusions can be made using those charts:

A single GPU provides an excellent speedup against a small number of threads (5x or more)
Multiple GPUs provide a very huge speedup against a small number of threads (up to 20x)
GPU speedup decreases as the maximum depth increases
Against peak CPU performance, GPU performance increase remains flat (but is still an increase in performance)
Adding more GPUs increases the performance linearly as long as the maximum depth is lower, otherwise it increases the performance very maginally (see: depth 2 vs depth 12 scaling)

From this point of view, it is very easy to emit the following conclusion:

For small trees and if reproducibility is not an issue, using a weak GPU is faster than using a monster CPU, as long as the data fits in GPU RAM. Otherwise, using CPU remains the best choice.

Just reiterating the hypotheses in case for our GPU > CPU conclusion:

Small trees (small maximum depth)
Not reproducible results
Data fits in GPU RAM
Weak GPU > Strong CPU

If you can live with non-reproducible results to do cross-validation and compare feature performance, that’s another story where doing proper statistics can help you.

More about RAM usage

xgboost GPU is pretty smart at using multiple GPUs. By taking our benchmark script, and using a 500K rows x 100 features matrix with 10% as validation (343 MB training set), we get the following script:

https://medium.com/media/66aeb2e8c658e9824d29650f45d141ec/href

Using nvidia-smi , you can exactly pinpoint the GPU RAM usage per process (we have to include the xgboost GPU process which takes 55 additional MB):

nvidia-smi showing exactly which process takes how many GPU RAM. A fixed 55MB (variable actually depending on GPUs used…) is mandatory in xgboost for GPU with R (might be identical for Python).

A more complete sample test script is below:

https://medium.com/media/5ebdfd949d759d7944cc20bbd1379cb6/href

GPU RAM when modifying row count

We get the following RAM results when using xgboost GPU with a matrix of size 450,000 x 100:

How much GPU RAM is used per GPU for 450K row x 100 feature matrix?

How much total GPU RAM is used for 450K row x 100 feature matrix?

As we can see, the total GPU RAM used for GPU increases dramatically as the maximum depth increases:

The lowest GPU RAM usage is below depth 5 (between 1 and 4)
The GPU RAM usage spikes from depth 9
The GPU RAM usage for depth 12 is very high (3 times higher than the lowest RAM usage for our small data)

Let’s try again, but with a matrix of size 1,000,000 x 100 (763 MB):

How much GPU RAM is used per GPU for 1M row x 100 feature matrix?

What about 5,000,000 x 100, which is closer to the limit of 4GB (the matrix is of size 3,814.7 MB):

How much GPU RAM is used per GPU for 5M row x 100 feature matrix?

And 10,000,000 x 100, a matrix of size 7,629.4 MB?:

How much GPU RAM is used per GPU for 10M row x 100 feature matrix?

Step up the game to crash when using 1 GPU, let’s go for 25,000,000 x 100, a matrix of size 19,073.5 MB:

How much GPU RAM is used per GPU for 25M row x 100 feature matrix?

We can go ahead to crash when using 2 GPU, with a 50,000,000 x 100 matrix (38,147 MB):

How much GPU RAM is used per GPU for 50M row x 100 feature matrix?

Do you think a 75,000,000 x 100 matrix (57,220.5 MB) will work for 4 GPU? It crashed!

GPU RAM seems to increase by a fixed amount when using a larger depth, which explains why we may have thought our small data GPU RAM explodes with high depth, while it does not for large data.

GPU RAM when modifying feature count

1,000,000 x 500 (3,814.7 MB): it crashes after depth 10!

How much GPU RAM is used per GPU for 1M row x 500 feature matrix?

1,000,000 x 1,000 (7,629.4 MB): it crashes after depth 9!:

How much GPU RAM is used per GPU for 1M row x 1K feature matrix?

1,000,000 x 2,500 (19,073.5 MB): it crashes after depth 7, and requires at least 2 GPUs!:

How much GPU RAM is used per GPU for 1M row x 2.5K feature matrix?

1,000,000 x 5,000 (38,147 MB): it crashes after depth 6, and requires at least 4 GPUs!:

How much GPU RAM is used per GPU for 1M row x 5K feature matrix?

Adding more features (to equal the same number of elements by adding more observations) seem to cost more for xgboost GPU RAM than adding more observations.

Conclusions about GPU RAM usage

If we consider the number of elements in a matrix (number of rows x number of features), then we can conclude the following:

Multiple GPU scales pretty well (nearly linearly) for GPU RAM usage
More depth means higher GPU RAM usage
The number of features has a higher weight than the number of observations (roughly 5-15% more?): it requires more GPU RAM when adding more features than adding more observations for an equal number of elements in the matrix
Higher number of features increases the risk of crashing xgboost GPU when using a large maximum depth (a magic number seems to exist)
There seems to be a formula to predict the GPU RAM required depending on the number of observations, the number of features, and the maximum depth.

Mega Conclusion

This is a very simple conclusion:

xgboost GPU is fast. Very fast. As long as it fits in RAM and you do not care about getting reproducible results (and getting crashes).

To keep getting those epic, stable and reproducible results (or if data is just too big for GPU RAM), keep using the CPU. There’s no real workaround (yet).

xgboost GPU performance on low-end GPU vs high-end CPU was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Investigating xgboost Exact scalability

Laurae — Mon, 21 May 2018 10:51:39 GMT

xgboost is a very well known Machine Learning technique based on Gradient Boosted Trees. The default xgboost is an exact method, which does not use pinning, and is significantly slower than the histogram-based version (fast histogram).

Two questions were asked recently about xgboost:

do I see it right that less threads than physical cores can provide fastes runtime?

interesting. so without fast hist, it’s a different story?

It is a whole new story.

Context

You may have seen my recent blog post (Getting the most of xgboost and LightGBM speed: Compiler, CPU pinning) which compares two compilers (Visual Studio and MinGW) and CPU pinning/roaming to find the best software setup configuration to run xgboost and LightGBM as fast as possible under Windows:

We should use Visual Studio to compile xgboost and LightGBM
CPU pinning seems useful for xgboost

The question of the day

What happens if we look now at xgboost exact? This is the topic of today, and we will go straight to the results as the benchmark setup is identical to the previous blog post.

The only differences are the following:

Exact xgboost instead of fast histogram
Every run is repeated twice

Benchmark results: from xgboost Exact to Fast Histogram

The Bosch dataset is very large: 6,000+ seconds is what a user should expect to spend training using the fastest available machine learning libraries at that time.

Big Data software (Hadoop / Spark etc.) is not what will save you from longer runtimes, what matters here are the algorithm and the performance optimizations. xgboost Exact can be viewed as approximately 10 times faster than R’s gbm and scikit-learn Gradient Boosting.

The Bosch dataset is very large for machine learning

As we directly look at the runtimes, we can find a large runtime slash when increasing the number of threads, between respectively:

Roaming CPU: 6,255.4s (1 thread) to 263.7s (22.7x faster)
Pinned CPU: 6,487.5s (1 thread) to 266.3s (23.4x faster)

xgboost literally takes forever to learn

xgboost seems to scale very well when adding more and more threads. We are going to qualify whether it scales very well or not in a more appropriate chart.

An ample reminder of the evolution between xgboost Exact and xgboost Fast histogram can be visualized below:

Approximately 8 threads and 120 seconds?!

With the histogram technique available first on LightGBM, xgboost Fast Histogram allows to slash the training time from 260 seconds (using a monster 56 threads) to 120 seconds (using a mere 8 threads only): a 117% performance increase for a similar model performance and using 7 times less computing power seems the best of the world!

LightGBM takes the crown of speed

LightGBM, the direct “concurrent” to xgboost, is significantly faster and drops the computation time from 120 seconds to 55 seconds for the same number of threads: a 118% performance increase.

xgboost Exact efficiency curve

So far we did some talk about the history of xgboost until we arrived to LightGBM. Let’s look at the computing efficiency of xgboost when scaling to more threads:

Ideally, we should have more than 28x efficiency using 56 threads

We can notice the following:

Not only xgboost Exact scales very well (over 2800% efficiency at 56 threads against a single thread)
But xgboost Exact still benefit a lot from hyperthreading (from threads 15 to 28, and 43 to 56, the efficiency keeps increasing)
And xgboost still manage to scale properly when NUMA issues arise (we are using a Dual Xeon, therefore managing memory improperly causes slowdowns)

We can’t say this is true for every situation, but this chart shows how well xgboost Exact is scaling when using it on large datasets.

We are counting in hours for a single thread, and in days for R’s gbm and Python’s scikit-learn.

Conclusion

xgboost Exact scales very well: this is a good example of a very well made program, tailored to scale on servers. Although xgboost is a sequential algorithm (Gradient Boosting is sequential, not parallel by nature), it still runs extremely fast when throwing more and more threads.

Recent advancements (throughout the last 4 years) slashed the training time from 100,000+ seconds to 50 seconds (2,000x, “two thousand” performance improvement) thanks to the following:

Parallelization / multithreading of the sequential task of Gradient Boosting
Code/Cache optimization of xgboost
Histogram/sketching idea of LightGBM

While in addition, improving then maintaining the high performance of the original machine learning algorithms. And also providing a proper and stable way to the industrialization of the algorithms (H2O still excels at this task).

The next part, if possible, will test LightGBM on dense data, as outlined here in GitHub.

Investigating xgboost Exact scalability was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Getting the most of xgboost and LightGBM speed: Compiler, CPU pinning

Laurae — Mon, 21 May 2018 10:51:07 GMT

Why should I change my computer setup if it works? To remove 1/3 of your time spending waiting for results!!!

Currently, xgboost and LightGBM are the two best performing machine learning algorithms for large datasets (both in speed and metric performance). They scale very well up to billion of observations and/or elements (ex: Reputation dataset, 53,181,000,000 elements).

xgboost and LightGBM were made primarily for speed: it is better to iterate quickly at high accuracy to try more different things, than waiting your neural network to finish after hours.

However, although they can be used on large datasets, the question of scalability was partially answered: how well xgboost and LightGBM are scaling? Do they prefer high frequency cores or more cores?

xgboost exact likes both many cores and high frequency, with a preference on both
xgboost fast histogram needs high frequency
LightGBM likes both many cores and high frequency, with a preference on high frequency

As we already know the answer to this question, we are going to look up for a more exotic situation: changing the compiler, and pinning CPU.

Are xgboost and LightGBM faster by swapping the compiler from MinGW to Visual Studio? Is CPU pinning a good thing to do?

This was also partially answered in this GitHub issue. Therefore, we are back with our Windows machine to do some benchmarks.

Interactive documents:

In the conclusion, an opening to GPU xgboost was included.

A quick review on the definition of a compiler and CPU pinning

Defining a compiler and CPU pinning

The place of the compiler for a source code and an executable

Haswell EP Xeon CPU die configuration: there are four RAM banks which does not have the same latency if you take two different group of cores!

Compiler: the compiler transforms the code of a source language into a code of a target language (usually to generate an executable). They are similar to a translator, and we all know translators do not have the same level of performance: some are providing gibberish words, some are providing excellent translations, which in turns make your interpretation of words slower or quicker.
CPU pinning: CPU pinning is the binding of a process (or thread) to a specific range of CPU cores. This way, the process will not roam anywhere as easily as it could without CPU pinning. When the process roams across CPUs, it incurs significantly higher RAM and cache latency: this is even more severe with multi-socket CPUs.

CPU pinning is also named CPU affinity, although the wording is inexact (“affinity” could mean “preference”, although it is not in this case: it is “this process uses this range and only this range of CPU cores”).

Benchmarking the differences

We are going to benchmark the difference between compilers and CPU pinning, for each number of threads available (1 to 56) on our server:

Two compilers to test: Visual Studio (Windows’ native) and MinGW (gcc)
Two CPU behaviors: CPU roaming (no pinning) and CPU pinning (by socket, then by physical core, then by hyperthreaded core).

The latter means the following: if we have 2 sockets, 4 physical cores on each socket, and hyperthreaded activated, we will try to contain all CPUs in one socket, first adding physical (yellow) cores, then adding logical (orange) cores:

Activation order of CPUs: 1, 3, 5, 7, 2, 4, 6, 8, 9, 11, 13, 15, 10, 12, 14, 16

We are benchmarking xgboost and LightGBM under the following environment:

CPU: Dual Intel Xeon E5–2697v3 (14 cores, 28 threads, 3.6 GHz singlethread, 3.1 GHz multithread)
RAM: 128GB RAM DDR4 2133 MHz
GPU: none
OS: Windows Server 2012 R2 Datacenter, without Meltdown/Spectre patch
R version: default 3.4.3
Compiler: Visual Studio 2017, MinGW 4.9 (R)
xgboost: commit 3f3f54b (Jan 16, 2018, 5:16 PM GMT+1)
LightGBM: commit 3dc5716 (Jan 18, 2018, 2:16 AM GMT+1)

The dataset:

Kaggle Bosch training dataset
Number of observations: 1,183,747
Number of features: 969
Sparsity: approx 81%

The algorithm parameters:

Number of boosting iterations: 200
Learning rate: 0.05
Maximum depth: 8
Maximum leaves: 255
Max bins: 255
Minimum hessian: 1
xgboost only: fast histogram, depth-wise
LightGBM only: minimum split loss of 1 (due to loss-guided optimization)

Each run were repeated at least twice, up to 10 times. It took approximately 1 week to run the benchmark, thanks to having so many threads!!!

Benchmark Results

Reminder: xgboost and LightGBM does not scale linearly at all.

xgboost is up to 154% faster than a single thread, while LightGBM is up to 1,116% faster than a single thread.

If you have a workstation…:

If you have 56 threads, do not expect that 56 threads to be 5,500% more efficient than 1 thread (it will not train 55x times faster).
If you have 28 cores, do not expect that 28 threads to be 2,700% more efficient than 1 thread (it will not train 27x times faster).
If you have a small dataset, do not expect lot of threads to scale well (it will negatively scale).

Showing the results taking the best case scenario (Visual Studio, Roaming CPUs) below:

Compiler Performance

By far, Visual Studio is the compiler to go on Windows. It is worth installing Visual C++ Build Tools to get the fastest training speed possible.

With roaming CPUs:

xgboost is very fast using Visual Studio instead of MinGW/gcc

LightGBM is a bit faster with Visual Studio instead of MinGW/gcc. Keep in mind, unfortunately, the MinGW slowdown happens at large depth.

With CPU pinning:

xgboost with MinGW depicts huge RAM latencies when spreading the CPU pinning on the physical cores and using 2 sockets at the same time.

LightGBM still likes more Visual Studio over MinGW.

CPU pinning Performance

CPU pinning increases the performance of xgboost with MinGW significantly. Otherwise, we are seeing performance degradation.

Story morale:

Use CPU pinning if you are using xgboost with MinGW.
Another case: if you are training parallel xgboost and LightGBM on the same machine, pin the CPUs in order to make sure CPU cache effects can trigger properly (ex: if you are training 4 xgboost models at the same time on a 4 core machine, pin each model process to a separate core).

With Visual Studio:

xgboost with Visual Studio requires CPU pinning for performance increases.

LightGBM seems faster without CPU pinning. Strange?

With MinGW:

With MinGW, xgboost does not need CPU pinning IT SEEMS.

LightGBM does not need CPU pinning also IT SEEMS.

Conclusion

Using Visual Studio without CPU pinning seems the best choice by far.

The recommendations for the power users wanting the most of their xgboost/LightGBM:

Use Visual Studio whenever possible
Train models without CPU pinning
And attempt to get higher CPU frequencies…

If you were forced to use xgboost in Windows, then force CPU pinning to increase the performance.

If you have single models to train, GPU xgboost seems the way to go due to how stable it became today. You do not even need a powerful server, even a laptop’s NVIDIA 1050 Ti outperforms our monster server.

NVIDIA 1050 Ti + GPU xgboost is FAST!

For curious, using a NVIDIA 1050 Ti (1.75 GHz) on a laptop with GPU xgboost, it takes 92 seconds to train a model. That’s 28 seconds faster than the fastest xgboost (Visual Studio + CPU pinning + 9 physical cores). An overclocked workstation would slash that time to about 60 seconds.

Find below the most brutal comparison in efficiency, when using xgboost and CPU pinning:

Which one do you prefer? A tool with 349% efficiency or a tool with 180% efficiency? The answer is very easy!

Next part: Investigating xgboost Exact scalability

Getting the most of xgboost and LightGBM speed: Compiler, CPU pinning was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Gigabyte Aero 14 review & benchmarks: laptop versus servers

Laurae — Tue, 30 Jan 2018 21:00:54 GMT

Introduction

I recently purchased a Gigabyte Aero 14K v7 (shortened as 14 in this post) after 6 months of tracking Internet for THE laptop I wanted. I am very picky on laptop specifications and usage, which makes it very difficult to match my needs and what I can get (read more about my needs later in this post).

The previous laptop which met all my needs was the HP Elitebook 840 G1: IPS 14" Full HD screen, i7, 16GB RAM, two SSDs (SATA + M.2 2242), nearly fully silent… Fully spec-ed out, it did cost 5 years ago over €4,000 with an international next day 3-year warranty (a 3/3/3 warranty) on-site (I got it under €700 due to the wrong keyboard in France).

Battle of Computers

Getting a laptop without comparing how good it is against its competition is futile. Actually, we will compare it against insanity which you can find below:

A simple ultraportable laptop under the name Acer Aspire 13 S5–371 (i7–7500U, 8GB RAM, 256GB SSD) with one of the most annoying fans in the world of laptops

https://medium.com/media/888a0a7d308f1ee97fde3220d55a8ce3/href

A workstation equipped with a i7–7700, 64GB RAM, 2x 500GB SSD, and a NVIDIA 1080
A server with a Dual Quanta Freedom (Ivy Bridge 2x 10 cores, 2.7 GHz), 128GB RAM, and 2x 500GB SSD
A server with a Dual Xeon E5–2697v3 (2x 14 cores, 3.1 GHz), 128GB RAM, and LSI MegaRAID with 4x 500GB SSD in RAID 10

The latter did broke into the top 20 world ranking of Cinebench R11.5 on October 12th, 2017:

Crushing Cinebench R11.5 with a 56 thread monster

Specifications

You may wonder what are the specs of my Gigabyte Aero 14 laptop? Note that it includes my upgrades on my Aero 14 (model name: Aero 14K v7):

Screen: 14", 2560x1440, matte, non touch
CPU: i7–7700HQ, undervolted (undervolt -130mV)
GPU: Intel HD Graphics, NVIDIA 1050 Ti 4GB RAM (undervolt -150mV)
Thermal Paste for CPU/GPU: Thermal Grizzly Kryonaut
RAM: 2x 16GB RAM 2400 MHz (Crucial CT16G4SFD824A)
SSD: Transcend 256GB MTS800 (default), Samsung 960 Evo 1TB
OS: Windows 8.1 Pro Update 3
Factory extras: USB to Ethernet cable, CPU and GPU -50mV undervolt
Weight: 1.9kg, about 500g for the charger
Thickness: 19mm

(From notebookcheck) Left side: Kensington lock, HDMI, USB 3.0, audio in/out, SD card

(From notebookcheck) Right side: USB 3.1 Type-C, Mini DisplayPort, 2 x USB 3.0, power

Cost & Upgrades

The cost was spread approximately the following way:

Laptop: €2,000
RAM upgrade (1x 16GB RAM): €190 (critical UPS shipping method is a bunch…)
SSD upgrade (Samsug 960 Evo 1TB): €450
Thermal Paste upgrade (Thermal Grizzly Kryonaut): €5
Operating System downgrade: €0 (got tons of MSDN licenses)
Grand total: €2,645

End of Introduction

This post will be divided into multiple sections:

What are my laptop usages?
Why did I choose this laptop? Because magic?
Purchasing online is weird?
Some “synthetic” benchmarks?
Real world usage?

Some images were taken from the notebookcheck review of the Gigabyte Aero 14 (NVIDIA GTX 1060, NVIDIA GTX 1050 Ti). For pictures of the laptop, go check them as it is way better =)

How do I use my laptop?

This section was added after a question was asked about the usage of my laptop:

What is this laptop used for? Machine learning? Gaming? General usage? Or all of the above?

All of the above, I will describe here more in details the main usages of my laptop. It may provide more details about how I am using my laptop and why it was very difficult to find such laptop.

Machine Learning & Data Analysis

An important point for me was to be able to use the machine for machine learning, for the following tasks:

Parallel xgboost (not xgboost multithreaded): requires a bunch of CPU threads with high frequency, lot of RAM
Deep learning / neural networks: GPU is mandatory
Lot of vertically scaling data analysis: more threads and more RAM helps a bunch
OpenCL / CUDA optimized code in R: need dedicated GPU…
32-bit data analysis: we are still in the JMP 10 / SPSS 21 era
Business Intelligence: Tableau and Qlik are CPU bound

Example: for data analyzing Porto Seguro dataset, if I do not refrain myself from using too much resources, I require on my server (56 CPU threads…) 20 minutes and 110GB RAM to produce meaningful automated reports for human analysis. On my new laptop, the same meaningful analysis takes 4 hours and 30GB RAM (this means you can go to shopping and come back after food with a report ready).

Typing Everyday Anywhere / Programming

I use my laptop to type stuff everyday and anywhere when not at work. This includes emails, chats (Slack, etc.), blogging (Medium), websites, programming…

When I need to write stuff, I use the following tools:

RStudio for R
Spyder for Python
Visual Studio / RStudio for C++
Git Bash for Git and Bash
Notepad++ and Visual Studio Code for other languages
Word / Excel / PowerPoint / Visio for documents
KiTTY, MobaXterm, Bitvise SSH Client… for SSH-ing
Remote Desktop (mstsc.exe) for remoting into another machine
Photoshop / Illustrator / InDesign for anything graphic related, visually critical on screen
Axure RP / Mockplus / JustInMind for anything UI/UX design related, visually critical on screen

Believe me or not, having keyboard macro keys help tremendously in getting very high typing speed. And a very light laptop (under 2kg) is a gigantic plus when you can not stay at the same place.

Rendering Scenes

I use Daz Studio Pro 4.9 and KeyShot 7, and require beefy CPU and GPU depending on my needs.

CPU helps a lot when processing single elements sequentially (think: load app, load textures, etc.), while GPU is the bazooka for final rendering of scenes. When GPU cannot be used for any reason, LuxRender allows blazing fast renderings using CPU only.

Virtual Machines

I often need virtual machines, literally everyday. I use Hyper-V and VirtualBox for virtualization. This allows me to:

Have multiple operating systems booted at the same time
Test distributed programs / machine learning properly
Test web services in an isolated and fully controlled environment
Test malware behavior
Run Windows 10 when applications are not running without it (eyes rolling at Adobe new software)

How did I choose my laptop?

For those who know myself personally, they all know I am very picky when it comes to purchasing (actually, investing) into a new laptop. Typically, here are my specs required:

Screen size: 15" maximum, matte mandatory, touch screen optional

Good luck being able to read anything when you put this in direct sunlight (Dell XPS 15)

Screen resolution: 1920x1080 minimum

Comfortable reading (from Razer Insider)

Webcam: at the top mandatory (Dell is blacklisted due to this)

Dell XPS 15 webcam placement: LOL

Operating System: must be able to install Windows 8.1

Kill all those metro apps starting from Windows 8 (from TechNorms)

CPU: 7th gen (Kaby Lake), hyperthreading available, ultra low voltage (Intel U) or quad core (Intel Q)

Who needs more performance? Dual Xeon E5–2697v3 in action (28 cores / 56 threads, 3.1GHz)

GPU: dedicated NVIDIA Pascal GPU optional, with power savings (Optimus), Intel Iris Plus/Pro preferred

Holding an NVIDIA Volta GPU and letting it compute 24/7/365 makes your home burn

RAM: 16GB RAM minimum, 32GB or 64GB RAM preferred

Much RAM coming soon (by Incero)

Drives: SSD, preferred NVMe with 4x PCIe lanes (~4GBps), preferred two SSDs

SSDNodes SSDs (by Incero)

Network: Wi-Fi, non Killer versions (Intel-only network cards)

ATM running Windows XP — Killer Wi-Fi cards are prone to crash under load and requiring an OS reboot

Mobile internet: a big plus but not mandatory (Huawei / Qualcomm preferred)

Slow mobile internet (by TechandGio) vs “fast Internet”

Ports: 3x USB any version is mandatory, VGA or HDMI is mandatory, mini Display Port is mandatory, Thunderbolt 3 is a big plus, charger with charging USB port is a big plus

Need MORE USB ports? Here are 40.

Keyboard: backlight mandatory (on all keys + all elements), macro keys is a big plus, mechanical feel preferred, centered touchpad preferred, Macbook keyboard style forbidden (no butterfly keys)

A keyboard like this (Gigabyte Aero 14)

Case: must be able to be opened

Gigabyte Aero 14 opened (NVIDIA 1060 version)

BIOS: must be able to edit more than what we get in a Surface Pro

Holy moly Surface Pro 1 BIOS is only this!

Battery life: at least 8.5 hours of battery life doing web browsing / work in Google Chrome

Dead battery by CollegeHumor

Weight: less than 2kg, less than 2.5kg with charger

Acer Predator 21X by linustechtips

Fan noise: next to none in any available silent mode after undervolting, manual throttling, CPU/GPU repasting, etc.

“FIX THE NOISE” (PS4 Fan Noise)

If you read the notebookcheck review of my laptop, you will find my laptop ticks everything I need as mandatory.

Shopping Online and Refunds

Before choosing definitely my laptop, I went through many of poor laptop choices which fits nearly all my needs.

What did I try?

The laptops I tried (non exclusive list) includes:

Razer Blade Stealth 4K 12.5", i7–7500U, 16GB RAM, 512GB SSD, which I refunded because the coil noise (on both the charger and the laptop) was driving me nuts even after the “BIOS update” which was supposed to fix it (hint: it just does some software tweaks but will not fix the coil noise)

Razer Blade Stealth: does it suck? (YouTube) — yes it does, far from a “Windows Macbook”

Apple Macbook Pro 13, 16GB RAM, 512GB SSD, which was delivered with the wrong keyboard (good luck doing development tasks using a French Apple keyboard!!!), with a small scratch on the back of the screen (small enough that at an Apple Store they could not even find it without looking carefully)

Apple Macbook Pro 13? — the keyboard!!!!!

Dell XPS 13, non Iris Plus version: screw this laptop as I returned it before Dell started to refuse returns due to coil noise (they preferred trying keeping my money and sending me a technician under their Premium Support — yes, not the ProSupport, it would have ended up like this)

Dell XPS 13 does not fit all my needs but it is better than nothing

Dell XPS 13, Iris Plus version, 8GB RAM: except the RAM issue (need more…), it could fit my needs if I could even order it…

A stronger Dell but could not even order it?! Only 8GB RAM though.

HP Spectre x2, Iris Plus, 16GB RAM: this laptop heats up very quickly and has poor battery life (4h or even less?), not recommended

HP Spectre x2 is just a toaster.

HP Spectre x360, 16GB RAM, 15" version: you just installed a private jet at home

HP Spectre x360 (15", with GPU) is the same as having a private jet at home

Dealing with Amazon, Apple, HP, Dell, etc.

I am putting only my results when dealing with Amazon, HP, and Dell support. Note that it may apply to France only.

Amazon France (top tier support/behavior):

Delivery: same-day delivery (19h–22h delivery), which is perfect when you work during the day
Support hours: I could contact support from 6h to midnight, which is again perfect when you work during the day
Support behavior: so far I did not see any customer support which beats Amazon
Behavior towards laptops: laptops must be wiped before sending, they also allow to fully wipe the drives before sending back the laptops (perfect for privacy-minded users)
Returns: print a prepaid paper, paste it on the original Amazon package, and send it for free (approximately 2 weeks to get refunded, which is fast)

Full format of drives!

Apple:

Delivery: custom to order (CTO) laptops takes 2 weeks, but when you will get delivered you get warned of the day and the hour by email/text
Support hours: anytime an Apple Store is open is better
Support behavior: they listen first then they ask (appropriate) questions
Behavior towards laptops: as I did not use my laptop, no idea whether we should wipe the drives or not
Returns: in Apple Stores, takes 2 days ONLY to get the money back, otherwise same as Amazon

Apple Store is best

HP France:

Delivery: 2 days to 1 week, no control over when you get delivered (have fun if you work, because you will struggle very hard to get your package)
Support hours: did not have to deal with them for a refund
Support behavior: unknown
Behavior towards laptops: unknown, but I wiped the drives before sending them back (no issue for refund)
Returns: same as Amazon

(HP got no physical shops in France?!)

Dell France: holy moly when trying to pay using PayPal (if you read French, read the Le Hollandais Volant):

You need an ID
You need a proof of home of where you will be delivered
They attempt to charge you 20% more than what you were initially charged (you can view this is double VAT for payment authorization)
They say they tried to call you, when they NEVER attempted it

They want your ID card and a proof you live where you will be delivered: good luck trying to send gifts for instance

The whole block of text:

Dell — Internal Use — Confidential

Cher client,

Chez Dell nous nous efforçons d’assurer l’intégrité des transactions par cartes bancaires dans le souci de protéger nos clients. Nous vérifions les commandes afin de valider les détails du paiement.

Votre commande a été vérifiée, mais malheureusement il nous a été impossible de vous contacter aux numéros de téléphone que vous nous avez fournis lors de votre commande. Par conséquent, nous sommes dans l’obligation d’annuler votre commande . La pièce manquante est un justificatif de domicile à l’adresse de livraison datant de moins de trois mois (dernière facture EDF ou France Télécom ou de Téléphone mobile) + une copie de pièce d’identité du détenteur de la carte de crédit ou kbis . Dès réception des documents demandés, votre commande sera validée et partira en production. Merci de nous envoyer ces documents au plus vite par email à l’adresse suivante :SER_CC_Validation@dell.com .Veuillez nous excuser pour le désagrément que nous vous causons, mais soyez rassuré quant à notre attention particulière suite à votre réponse.

I would just put Dell in a blacklist and try to get their laptops through Amazon FR or Amazon DE.

Synthetic Benchmarks

Here, we will take our machines and make them fight against each other in benchmarks. We are taking useful (comparable) fighting cases for our machines.

What are we testing?

We are going to use three benchmarks:

Cinebench R11.5 and R15, on CPU: get the magnitude of difference between a laptop and a powerful server
Cinebench R11.5 and R15, on GPU: how powerful is our NVIDIA GTX 1050 Ti against Intel HD Graphics?
AS SSD: how crazy can be the Samsung 960 Evo 1TB?

What are we playing against?

CPU modern warfare: 28c/56t monster versus small machines

My Gigabyte Aero 14 is going to be tested against several machines, in the following increasing order of performance:

Acer Aspire 13 S5–371: i7–7500U (2c/4t, 3.5/3.5GHz), Intel HD Graphics 620
Gigabyte Aero 14, near-silent wattage (33dB): i7–7700HQ (4c/8t, 2.3/3.6GHz), NVIDIA GTX 1050 Ti 4GB
Gigabyte Aero 14, “Gaming” fans mode (37dB): i7–7700HQ (4c/8t, 3.5/3.9GHz), NVIDIA GTX 1050 Ti 4GB
Workstation: i7–7700 (4c/8t, 4.0/4.2GHz)
Server 1: Dual Quanta Freedom Ivy Bridge (2x 10c/20t, 2.7/3.3GHz)
Server 2: Dual Xeon E5–2697v3 (2x 14c/28t, 3.1(2.9)/3.5GHz)

The server 2 cannot sustain 3.1GHz as it exceeds its power limits (145W), it throttles down to 2.9GHz (which is still higher than its base clock).

Extra additions

I applied stronger undervolt but it did not improve anything (it actually consumed more watts at the wall!) — lowering undervolt by 10mV was beneficial on CPU and the integrated GPU (not the dedicated GPU)

The following was applied on our Gigabyte Aero 14 past our benchmarks, and it did not change any results (other than making the laptop less loud):

Thermal Paste: Thermal Kryonaut Grizzly
CPU undervolt: -130mV
GPU undervolt: -150mV

We also did the following on our Acer Aspire 13 S5–371:

CPU undervolt: -90mV
GPU undervolt: -90mV
Turbo Boost Power Max: 25W

Cinebench R15

As expected, our Aero 14 is getting crushed by our workstation and servers, but it beats very easily old Intel CPUs (in singlethread) and our ultra low voltage laptop (being twice as slow for multithreaded tasks).

If CPU performance matters, don’t buy a laptop: purchase or rent a server. A desktop with a simple i7–7700K will not be up for your task, as you can get Intel i7–7820HK overclockable CPU in a mobile laptop under 3kg (check for Clevo chassis laptops).

Screenshot of results:

Cinebench R15: Acer Aspire (i7–7500U), Workstation (i7–7700), 20 core Quanta Freedom, 28 core Xeon E5–2697v3, Aero 14, Aero 14 Throttled

Cinebench R11.5

The Gigabyte Aero 14 is doing fairly well as on Cinebench R15 despite being only a quad core mobile CPU. When it comes to GPU, the NVIDIA GTX 1050 Ti just crushes our Intel HD Graphics 620.

Screenshot of results:

Cinebench R11.5: Acer Aspire (i7–7500U), Workstation (i7–7700), 20 core Quanta Freedom, 28 core Xeon E5–2697v3, Aero 14, Aero 14 Throttled

AS SSD Scores

The Samsung 960 Evo makes every non NVMe SSD look idiot. Cheap prices, high capacity, what are you expecting for €450? (a Samsung 960 Evo 1TB, supposing we ignore the Samsung 960 Pro 1TB at €600)

However, when the price of the laptop is €2,000, finding such Transcend MTS800 SSD (which is also only 256GB) is unacceptable. I’ll contact Gigabyte to know what is their point of view towards this, as we all know we wants those Samsung PM961 inside and not that poor Transcend SSD which cannot even go faster than SATA SSD speeds.

AS SSD General Results (MBps)

Just so you can check how the Samsung 960 Evo crushes the competition, and how bad the factory Transcend MT800 is.

Screenshot of results:

AS SSD Benchmark: Acer Aspire, Workstation, 20 core Quanta Freedom, 28 core Xeon E5–2697v3, Aero 14 (Samsung 960 Evo), Aero 14 (Transcend)

AS SSD General Results (IOPS)

Just so you can check how the Samsung 960 Evo crushes the competition, and how bad the factory Transcend MT800 is.

Screenshot of results:

AS-SSD Benchmark: Acer Aspire, Workstation, 20 core Quanta Freedom, 28 core Xeon E5–2697v3, Aero 14 (Samsung 960 Evo), Aero 14 (Transcend)

AS SSD Copy Results

The Transcend MTS800 is getting crushed by a Crucial MX300.

Screenshot of results:

AS-SSD Benchmark: Acer Aspire, Workstation, 20 core Quanta Freedom, 28 core Xeon E5–2697v3, Aero 14 (Samsung 960 Evo), Aero 14 (Transcend)

AS SSD Compression Results

How fast is the Samsung 960 Evo compressing data? Too fast!

Screenshot of results:

AS-SSD Benchmark: Acer Aspire, Workstation, 20 core Quanta Freedom, 28 core Xeon E5–2697v3, Aero 14 (Samsung 960 Evo), Aero 14 (Transcend)

Real world usage of the laptop

I got my Gigabyte Aero 14 since last week, and so far I am very happy when it comes to the performance, noise, keyboard, screen, and battery. I got a major issue with drivers and random crashes due to (old) NVIDIA drivers.

When opening the laptop, two DVD are provided in addition to the laptop and the USB / Ethernet cable:

7GB DVD for drivers
Cyberlink PowerDVD 12

The major issue: drivers

First of all, this major driver issue could be even worse if Gigabyte did not provide a 7GB DVD with all drivers and the 17GB backup “GIGAWIN10RC” USB (you need to make the USB yourself, but you get prompted right at the beginning to do it).

There issome Windows 8.1 stuff inside?!

Second, I am using the laptop in an unsupported scenario: Windows 8.1 Update 3 (however, they do have drivers for Windows 7 even when using “unsupported” Kaby Lake CPUs which is very rare!!!).

Broken hopes for finding a Windows 8.1 image =(

Third, all installable software is very easy to install.

Just click everywhere to install (for instance, it says I do not have Thunderbolt drivers installed at that time, for Bluetooth drivers it is bugged)

Bundled software — did not install Intel RST Premium as I use AHCI and Samsung 960 Evo drivers

When it comes install GPU drivers, this is another story:

Intel HD Graphics drivers cannot be installed on Windows 8.1 without some small hacks
NVIDIA drivers cannot be installed without installing Intel HD Graphics drivers before

Solution: do some Intel HD Graphics driver hacking (there are guides online), install the drivers, and install NVIDIA drivers afterwards. This solution alone would put off anyone who does not want to get into driver hacking.

oh my gawd: Windows 8.1 + Kaby Lake!!!

Performance and Noise

There is nothing wrong with the performance of this laptop, except some weird and buggy behavior of the laptop graphic cards:

Let me use PhysX using my NVIDIA GTX 1050 Ti please…

When it comes to the noise, the only issue is when putting the fans into Gaming Mode. I tend to use Quiet Mode instead, although it throttles the CPU quite a lot for a near silent operation (33dB at full load after undervolting):

Gigabyte Smart Manager — Clicking on help brings a nice PDF which explains everything you can do in the Smart Manager

The Gigabyte Smart Manager allows to control everything on the laptop, except the Bluetooth button which is broken in Windows 8.1.

The controls are the following:

Change volume (nothing exceptional)
Mute sounds (nothing exceptional)

Change brightness / Automatic brightness (the latter can only be done here)

Power Mode (same as Control Panel > Hardware & Sound > Power Options)
Wi-Fi on/off (nothing exceptional)
Bluetooth (is broken)
Camera (nothing exceptional)

Keyboard Backlight (can also do Fn+Space)
Monitor Switch (opens the charms)
Mouse Speed (faster than going into Control Panel)
Windows Key Lock (holy moly the amount of time you might accidentally press the Windows button)
Font Setting (DPI change…)

X-Rite Pantone (calibrated display, I measured less than 2 of difference)
White Color / Blue-light Killer (nice to have)

Fan Tweaks (this one is vital as it allows you to control the fans directly, and their modes: Quiet (silent or near fully silent), Normal (not silent but not loud), Gaming (not silent to loud), Custom Auto (not silent, maximum noise allowed), Custom Fixed (permanent noise))
Smart Dashboard

Gigabyte Smart Dashboard (yes, 0 RPM fans since I’m writing this post)

As for the fans, here are the settings:

Quiet fan: 0 RPM (until you go over 60°C CPU/GPU, CPU throttles, no GPU throttle — maximum fan seems 30% on CPU, 40% on GPU)
Normal / Gaming fan base: 2167 RPM (same as 30% auto fan noise)
30% auto fan (ex-25%): 2204 RPM (32dB, you will not even notice it from far)
30% fixed fan: 2551 RPM (33dB, a bit sleepy)
40% fixed fan: 3208 RPM (35dB, maximum you will encounter in quiet mode?)
50% fixed fan: 3660 RPM (37dB, your occasional peak)
70% fixed fan: 4647 RPM (you can feel the wind from your seat, 42dB)
100% fixed fan: 5615 RPM (oh my god you have a private jet at home, 50dB)

The laptop is full aluminium with a bit of plastic (a lot of plastic for the screen and the ports cough cough), and gets warm quickly if you stress the CPU and GPU a lot. In addition, the fans blows up on the monitor, which might be unusual if you are not used to it.

When opening the case, due to how tight is everything, you may think you are exploding the laptop with the cracking noises.

Laptop Life

This laptop is a gem as it combines the following:

Reported battery life of the Gigabyte Aero 14 on notebookcheck

Not so bad GPU (NVIDIA 1050 Ti)
Small weight / high portability: 14", less than 2kg (1.9kg)
High performance mobile CPU: i7–7700HQ
Long battery life: I usually hit 10h to 11h battery life
High quality calibrated screen
Two drives with 4x PCIe
Mechanical keyboard and Macro keys

As for the keyboard, it feels close to a mechanical keyboard and will hurt anyone who uses only membrane keyboards.

I do not recommend this laptop who are not used to typing on mechanical keyboards, because they will get tired very quickly. However, if they keep trying on this laptop, they will get rewarded with the Macro keys which allows you to perform an action / series of actions automatically on a single key press.

Macro keys, when empty. You can record up to 88 macros (out of 100).

We can also switch the macro used by pressing the “G” button, effectively changing the column of the macros used (there are 5 columns with their respective colors, visible on the keyboard).

Note that when holding the laptop, it feels common to cut the skin of the hands due to how sharp the edges of the case can be. The solution is to stop trying to be MacGyver and to learn to pick and hold the laptop properly.

For the fans, as long as you use the Quiet mode, you will very rarely encounter the fan noise. Even watching YouTube videos does not trigger the fans.

Conclusion

This laptop is very expensive and the luxury of the users who need to make the most out of their laptop.

Do not use purchase this laptop if you are lurking to do a single thing, because this laptop fits the following type of user:

https://medium.com/media/7e440bfffed6106651bef3ff7394de99/href

tl;dr: one size fits all, jack of all trades

Checklist:

Portable laptop (light, small screen, noiseless)
High quality matte screen (QHD IPS, 2560x1440, well placed webcam)
High battery life (8h+)
Many ports (USB 3, HDMI, mini Display Port, Thunderbolt 3, SD card)
Acceptable performance both on CPU (4 cores) and GPU (not Intel)
Self-serviceable / upgradable (2x DDR4–2400 RAM, 2x NVMe M.2 2280 storage)

If you do not need one of those, you can get an equivalent of that “Gigabyte Aero 14” for half the price (don’t say Dell XPS 15 is the answer, its noise is insanity).

What else would you need?

Oh wait. If you click the touchpad while the laptop is off, you can get an idea about how much battery left you have. Perfect for travelers.

Gigabyte Aero 14 review & benchmarks: laptop versus servers was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Is KVM virtualization slowing down CPU computations?

Laurae — Thu, 31 Aug 2017 10:12:06 GMT

When it comes to pure CPU computations, KVM is doing a great job at providing maximum performance when tuned properly. And when it comes to raw CPU performance, Cinebench R15 is just one of the best benchmarking tools: CPU bounded, slightly RAM dependent.

Not tuning properly CPU pinning and NUMA nodes may lower the Cinebench R15 scores by about 10 to 20% (Geekbench 4 may also get… 70% lower scores, when it comes to use hugepages).

We will take our 20 core machine as the baseline:

CPU topology of our 20 core server

Quanta Freedom Ivy Bridge (2x 10c/20t, 3.1/2.7GHz) with aircooling
96GB RAM
2x 525GB SSDs
Host machine: Ubuntu 16.04 with stock kernel
Virtual machine: Windows Server 2012 R2 Datacenter, using KVM, with replicated host topology (2 sockets, 10 cores, 2 threads)
Baremetal machine: Windows Server 2008 R2 SP1 Datacenter, without Hyper-V role

Cosmetic differences may be brutal for those who did not use Windows XP or Windows 7 for a while:

Linux htop, pretty clear

Windows XP (Windows 7 without Aero), and Windows 7 Task Managers

Windows 8 Task Manager

You may try to find out (without looking the operating system…) which one is the virtual machine below:

The difference is unnoticeable, and the runs shown here are the median runs (11 runs, 6th best run). The benchmark ranged approximately from 2270 to 2310 on both machines.

Conclusion: KVM does not affect CPU performance of your machine, on CPU bounded tasks, only if your virtualization setup is correct.

Is KVM virtualization slowing down CPU computations? was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

LightGBM on Windows: Visual Studio vs MinGW (gcc), R with Visual Studio

Laurae — Sat, 10 Jun 2017 12:21:16 GMT

Thinking on using LightGBM on Windows? You know you are given two hard choices: Visual Studio or MinGW (gcc).

Visual Studio 2017 is alone a whooping 2GB, excluding external dependencies.

But everyone knows Visual Studio is a pain to install. Even the Microsoft Build toolset does not alleviate the pain of having a large download to do before even being able to compile something.

Even though the installation is about 2GB for Visual Studio 2017 (because you may want the GUI to test R/Python integration after all), it is significantly better than the previous 8GB for Visual Studio 2015!

Meanwhile, with MinGW (x86_64-posix-seh, aka 64-bit + posix threads + seh debug), a simple 50MB file to download and extract eases the life!

MinGW x86_64-posix-seh is big? Think again.

But are you losing something when using MinGW and going the “easy way”? This is what we are going to check (quickly)…

What is sparking the need to check for Visual Studio vs MinGW?

I think you will understand visually, there is no need to explain.

MinGW/gcc (left) vs Visual Studio (right): CPU usage different under the same settings, but with only a difference: the compiler?

It becomes obvious from this comparison picture that we have a major issue with MinGW/gcc: the CPUs are not busy enough on large datasets, while Visual Studio maintains all cores busy!

Some benchmark comparisons of Visual Studio and MinGW

You can find all detailed benchmarks on the following links:

GitHub: Microsoft/LightGBM#542 (Visual Studio reports higher CPU usage than MinGW)
GitHub: Laurae2/gbt_benchmarks#1 (some questions)

Laptop benchmark (2 physical cores)

My main laptop has a i7–4600U CPU with 16GB RAM. We can check very quickly its performance on Bosch dataset (1M observation and 1K features dataset), which fits nicely in our RAM.

We are testing LightGBM under the following scenarii:

Visual Studio 2017 on CLI (master)
MinGW 7.1 on R (master and v2.0)
MinGW 7.1 on CLI (master and v2.0)

Unexpectedly, Visual Studio is slower than MinGW. For small number of threads, it seems MinGW is better (even with R callback and processing overhead) than Visual Studio.

When comparing CLIs (Visual Studio and MinGW), the difference is a well-sized 5%.

R overhead is approximately 3% of the computation time.

Server benchmark (20 physical cores)

My main server has a Dual Xeon Ivy Bridge (Quanta Freedom) with 80GB RAM allocated to a virtual machine. Performance checkup is done again on Bosch dataset.

We notice quickly the more threads we throw, the more performance we have. The difference is so heavy that it reaches:

15% worse for not using hyperthreaded cores
Up to 40% worse for using MinGW and not hyperthreaded cores instead of Visual Studio with hyperthreaded cores

It is obvious who is the winner here: Visual Studio.

Versus xgboost?

Just for eyes obviously, using my laptop with 2 physical cores (4 threads):

xgboost (fast histogram) has bridged the performance gap with LightGBM. They are only 5% apart in this case.

Conclusion

A quick conclusion could be the following:

Windows users should use MinGW for LightGBM when they are using low-end machines, such as laptops with 2 cores only. When reaching more cores (like 4 physical cores), it is recommended to use Visual Studio to reach maximum performance.

This is the reason the pull request “Compile R package by custom tool chain” is existing: if you have a high performance tool, then make sure you are using that high performance at its fullest! It means in our case: compile with Visual Studio, but use in R.

“I have no idea what I’m doing” meme

Apparently, it also eases the installation, especially for Mac OS users.

If you do not know what are you doing, use Visual Studio.

This is as simple as doing a simple math addition: setup your PATH environment variable correctly!

LightGBM on Windows: Visual Studio vs MinGW (gcc), R with Visual Studio was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Interview: AMD Ryzen as a workstation

Laurae — Thu, 25 May 2017 13:48:34 GMT

Today, we got an interview with drverzal about a brand new AMD Ryzen rig!

Is AMD Ryzen a good CPU? What can you do with it? What should you expect from it? We will have in a separate post a comparison against i7–7700K to check the performance difference!

Interview Questions

We are going to ask some questions to our new AMD Ryzen owner…!

What is your new rig specs?

CPU: AMD Ryzen 1700
GPU: NVIDIA 1080 Ti
Motherboard: ASUS ROG Crosshair VI Hero
RAM: G.SKILL TridentZ RGB Series 32GB (4 x 8GB) 288-Pin DDR4 SDRAM DDR4 2400
CPU Cooling: NZXT Kraken X62
Hard Drive: M.2 Samsung EVO 960

What are you aiming to do with your rig?

I wanted my rig to be well-rounded. It’s not a monstrous computing machine, but it’s sufficient for exploring small to medium sized models in a timely manner.

Why choosing AMD instead of Intel for this specific rig?

From all the research that I did, the price-to-performance ratio of the Ryzen CPUs was astounding in multithreaded applications. Thus far, my benchmarks seem to agree.

drverzal results with a stock AMD Ryzen 7 1700 on Laurae’s xgboost benchmarks: 247 seconds on exact with 16 threads (8% faster than an overclocked i7–7700K at 5.0GHz), 400 seconds on fast with 6 threads (100% slower than an overclocked i7–7700K at 5.0GHz).

See here for the detailed benchmarks: Benchmarking xgboost with and without virtualization

What would be equivalent pricing using Intel-only CPUs?

Grabbing the two chips nearest my scores come in at $613 and $660 whereas the Ryzen 1700 came in a $330.

AMD Ryzen 1800X + 32GB RAM with 1Gbps network setups are available online for renting for $79.00/month online. Is it better to own your own desktop than renting?

I utilize my machine for much more than only machine learning. I produce/write music where I utilize Ableton as my DAW (Digital Audio Workstation) and I enjoy a video game or two. To me, it was a better decision to buy.

Advantages/Disadvantages of choosing AMD instead of Intel?

The biggest advantage is undoubtedly the price-to-performance, with a large caveat, for multithreaded applications!

Intel still takes the cake in single threaded performance.

What was your previous rig?

AMD: FX 8350
GPU: RX 480
RAM: 8GB

What would you prefer using a desktop or a laptop for data science?

Personally, I’d go with a desktop. There’s nothing better to me than having a nice home office. That and having a multi-monitor setup is always enjoyable!

How does AMD CPUs compare against Intel CPUs in raw performance?

Cinebench R15:

Singlethread: 127
Multithread: 1378

Screenshot for Cinebench R15 score with AMD Ryzen 7 1700

Comparison: we are hitting with multithread “only” 1000 with a i7–7700K.

Did you overclock? How far did you go? Was it easy?

That was easy, 4.0 GHz on 8 cores!

Comparison: i7–7700K can reach 5.0 GHz on 4 cores, but many reports online are showing dead CPUs.

How does AMD CPUs compare to your previous rig for raw machine learning speed?

To compare scores: Linus Tech Tips Cinebench R15 score list

Looking through the list of Cinebench scores from above, the highest overclock FX 8350 came in at #348 with a score of 842. The stock 1700 came in at 1378.

There are two added benefits I’m feeling so far:

One, small models are nearly instant and allow me to work without interruption;
Two, the reduction in training time for larger models mean I’m not waiting on results as long.

Basically, I get results faster and more importantly, I can learn faster.

Opinion on AMD CPUs = too fast too furious?

I’m loving the Ryzen chip so far. As you might be able to tell from above, I’ve been supporting AMD for quite some time now.

I’m a big fan of keeping competition in there to keep Intel honest. It makes it a lot easier to support Team Red when they’re putting out killer products like the Ryzen lineup.

What about RAM? Are you a multiprocessing or multithreading user?

I upped to 32GB of RAM for training larger models. I’m typically not building models on anything much larger than this at home.

ANY COMPUTER PICS? FLASHY COLORS? VROOM VROOM NOISE? OVERCLOCKING MANIAC? WATERCOOLING MAGICIAN?

I’m still waiting on the mounting bracket for my water cooler :*(

Interview: AMD Ryzen as a workstation was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Benchmarking xgboost with and without virtualization

Laurae — Thu, 25 May 2017 13:19:32 GMT

We have seen previously that the xgboost had a new fast histogram method leading to blazing performance. All our tests were done on a virtualized environment. What if we compare it in the most unfair scenario?:

Virtualized machine: Linux host, KVM virtualization, Windows client
Baremetal machine: Linux

This is what we are going to do. We have access to two extra machines, thanks to Yifan Xie (Intel machine) and drverzal (AMD machine) who helped for the benchmarking of xgboost exact and fast histogram:

Intel i7–7700K overclocked 5.0/4.7GHz, 64GB RAM, baremetal Linux
AMD Ryzen 7 1700 3.7/3.2GHz, 16GB RAM, baremetal Windows

Benchmarking

We are going to use the following to benchmark the three machines:

xgboost Exact

gc(verbose = FALSE)
set.seed(11111)
StartTime <- System$currentTimeMillis()
temp_model <- xgb.train(data = xgb_data,
                       nthread = i,
                       nrounds = 50,
                       max_leaves = 255,
                       max_depth = 6,
                       eta = 0.20,
                       tree_method = "exact",
                       booster = "gbtree",
                       objective = "binary:logistic",
                       verbose = 2)

xgboost Fast Histogram (old version)

gc(verbose = FALSE)
set.seed(11111)
StartTime <- System$currentTimeMillis()
temp_model <- xgb.train(data = xgb_data,
                        nthread = i,
                        nrounds = 200,
                        max_leaves = 255,
                        max_depth = 12,
                        eta = 0.05,
                        tree_method = "hist",
                        max_bin = 255,
                        booster = "gbtree",
                        objective = "binary:logistic",
                        verbose = 2)

Our xgboost tests consist on a training with the following parameter set on numeric Bosch full dataset (1,183,747 observations, 969 features, unbalanced dataset with 6,879 positive cases only).

Think it is hard to compile xgboost? Not at all:

devtools::install_github("Laurae2/ez_xgb/R-package@2017-02-15-v1")

Exact xgboost

tl;dr: baremetal wins.

Normalization per thread comparison:

Baremetal is faster overall.
AMD Ryzen is slower overall.
If we use AMD hyperthreading, our virtualized Intel machine gets smoked (in fact, the baremetal machine also gets smoked).

Cumulated Normalization per thread comparison:

Seen in this cumulated way, AMD is not that slow.
In fact, we would expect Intel to do much better but 47% higher clock for approximately 30% higher average faster time is clearly not that efficient.

Detailed Data Chart:

Need details? Ranking is obvious: Baremetal Intel (Linux) > Virtualized Intel (Windows) > Baremetal AMD (Windows)

Fast Histogram xgboost

tl;dr: gcc 7.1 wins.

Normalization per thread comparison:

With fast histogram, the GHz showoff starts. 35% higher clock rate for nearly 50% higher speed, isn’t it marvelous? (singlethread performance)
AMD is nowhere coming next to Intel (yet). Keep in mind if you are looking to get faster training from exact xgboost, fast histogram xgboost will just do it 10x to 30x faster (or even more) on large datasets.

Cumulated Normalization per thread comparison:

I think the conclusion is very easy to draw: the advantage of a virtualized Windows with Intel vs a baremetal AMD is 2/3 of the best scenario (baremetal Windows with Intel).

Detailed Data Chart:

I don’t think you can complain about doing 200 training iterations on Bosch in only 400 seconds (or less) these days.
Remember we are talking about training on 1,147,050,843 elements, and even a 90% sparsity would still make hundred of millions elements.

Conclusion

Some simple key takeways:

Use a baremetal machine if you want maximum performance. It does not really matter whether you want Linux or Windows, you already have plenty of performance.
Fast Histogram xgboost is already plently in performance. And you can even get a larger performance using the new fast histogram!
Throwing more cores is not the ideal for fast histogram xgboost, while exact xgboost likes getting more cores.
When comparing two algorithms, take the same baseline. The picture shown before does not take into account the difference on the number of training iterations. You are actually doing 4 times more iterations with fast histogram xgboost than exact xgboost, thus getting to the point of convergence. Also, the hyperparameters are RAM intensive for fast histogram xgboost (larger depth).

Previous post in this series:

Benchmarking xgboost with and without virtualization was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Benchmarking new xgboost fast histogram: xgboost and the compiler story

Laurae — Sun, 14 May 2017 16:07:06 GMT

We have seen previously that the new xgboost fast histogram method had an issue: it was awfully slow. But we fixed it. By recompiling R with gcc 7.1.

How do you call someone compiling R from scratch in Windows?

Compiling R was something tough, but I have now an executable I can use on all my servers to deploy R with gcc 7.1 without any issue:

Even better, all libraries are compiled with gcc 7.1!
It makes the new xgboost fast histogram fly!

Therefore, we are going to benchmark two different things from xgboost:

xgboost old fast histogram with gcc 4.9 (Rtools) and gcc 7.1 (MinGW)
xgboost new fast histogram with gcc 7.1 (MinGW)

Comparing xgboost old fast histogram with gcc 4.9 and gcc 7.1

To compare the xgboost old fast histogram with different compilers, we will use:

R/xgboost compiled with gcc 4.9
R/xgboost compiled with gcc 7.1

And no, do not tell me to compile it with something else. It is already difficult enough to compile R in Windows.

Intel i7–3930K: gcc 4.9 vs gcc 7.1

tl;dr: gcc 7.1 wins.

Normalization per thread comparison:

gcc 7.1 is the winner overall.

Cumulated Normalization per thread comparison:

gcc 7.1 clearly wins.

Detailed Data Chart:

gcc 7.1 is the winner 11 times out of 12.

Intel i7–7700K: gcc 4.9 vs gcc 7.1*

tl;dr: gcc 4.9 wins but… (* read conclusion before making conclusions, there was a linux kernel version issue)

Normalization per thread comparison:

gcc 4.9 is the winner overall.

Cumulated Normalization per thread comparison:

gcc 4.9 clearly wins.

Detailed Data Chart:

gcc 4.9 is the winner 100% of times (8 out of 8).

Dual Quanta Freedom Ivy Bridge: gcc 4.9 vs gcc 7.1

tl;dr: gcc 7.1 wins.

Normalization per thread comparison:

gcc 7.1 is the winner overall.

Cumulated Normalization per thread comparison:

gcc 7.1 learly wins.

Detailed Data Chart:

gcc 7.1 is the winner 18 times out of 20.

Conclusion about gcc and xgboost old fast histogram

i7–3930K: gcc 7.1 won
i7–7700K: gcc 4.9 won*
20 core server: gcc 7.1 won

In the case of the i7–7700K, I reinstalled the whole virtualization machine (host machine) which means it also changed the linux kernel (4.10 for gcc 4.9, 4.9 for gcc 7.1). Running the same benchmark with kernel 4.9 and gcc 4.9 leads to 7.1 winning 100%.

gcc 7.1 won “all the times” against gcc 4.9.

So the real conclusion would be… gcc 7.1 “won all times” (if not losing a little bit somewhere).

Comparing xgboost fast histogram: old vs new

Now we will be interested into comparing xgboost fast histogram old and new versions. Will the new version reign supreme? This is what we will check.

You can install the used xgboost versions using the commands below:

old xgboost fast histogram: devtools::install_github("Laurae2/ez_xgb/R-package@2017-02-15-v1")
new xgboost fast histogram: devtools::install_github("Laurae2/ez_xgb/R-package@2017-05-02-v2")

I think I will not even have to comment, results are obvious.

Intel i7–3930K: old vs new xgboost fast histogram

tl;dr: new fast histogram wins.

Normalization per thread comparison:

Cumulated Normalization per thread comparison:

Detailed Data Chart:

Intel i7–7700K: old vs new xgboost fast histogram

tl;dr: new fast histogram wins.

Normalization per thread comparison:

Cumulated Normalization per thread comparison:

Detailed Data Chart:

Dual Quanta Freedom Ivy Bridge: old vs new xgboost fast histogram

tl;dr: new fast histogram wins.

Normalization per thread comparison:

Cumulated Normalization per thread comparison:

Detailed Data Chart:

Old vs New Fast Histogram: all servers together

Need to compare the performance visually with big charts? Here you are served:

i7–7700K is just the “KING” (or the “QUEEN” if you want it that way)
The new xgboost fast histogram is just smoking everything

Clearly, going over 1 thread is already providing a poor ROI (return on investment, but applied to CPU threads). For instance, on i7–7700K, you better do a 4-fold cross-validation using a parallelized cross-validation:

Parallelized cross-validation: less than 5 minutes for doing 4 parallel trainings using 1 thread each.
Sequential cross-validation: about 12 minutes for doing a training one by one using 3 threads each (assuming you found out the sweet spot).

Did you ever wanted to get a cross-validation speedup?

Assuming you have enough RAM, here you have it.

Conclusion

VERY simple key takeways:

New xgboost fast histogram is crushing everything.
1 thread new xgboost fast histogram is 75% faster than the old xgboost fast histogram.
gcc 7.1 is approximately 3% faster than gcc 4.9 for xgboost fast histogram.

Still using the old xgboost fast histogram? Switch to the new one!

But are you satisfied enough?

We have ONE blog post which will follow this series:

Benchmarking Baremetal Linux vs Virtualized Windows: how slow are we? AMD Ryzen showing up!

Previous post in this series:

Benchmarking new xgboost fast histogram: xgboost and the compiler story was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.

Exact xgboost and Fast Histogram xgboost training speed comparison

Laurae — Sat, 29 Apr 2017 23:25:39 GMT

Did you ever wanted to compare “unfairly” Exact xgboost and Fast Histogram xgboost? Here you are served.

How unfair our comparisons will be? We are using our results from our series:

This post is mainly for the “eyes” of the reader.

Comparison Setup

Hardware Virtualization

Three servers with their best cumulated runs used:

i7–3930K, 6 cores, 12 threads, 3.9/3.5GHz, VMware virtualization
i7–7700K, 4 cores, 8 threads, 5.0/4.7GHz, KVM virtualization
Dual Quanta Freedom Ivy Bridge, 20 cores, 40 threads, 3.1/2.7GHz, KVM virtualization, NUMA fully optimized

Software Setup

Exact xgboost:

gc(verbose = FALSE)
set.seed(11111)
temp_model <- xgb.train(data = xgb_data,
                        nthread = i,
                        nrounds = 50,
                        max_leaves = 255,
                        #max_depth = 6,
                        eta = 0.20,
                        tree_method = "exact",
                        #max_bin = 255,
                        booster = "gbtree",
                        objective = "binary:logistic",
                        verbose = 2)

Fast Histogram xgboost:

gc(verbose = FALSE)
set.seed(11111)
temp_model <- xgb.train(data = xgb_data,
                        nthread = i,
                        nrounds = 200,
                        max_leaves = 255,
                        max_depth = 12,
                        eta = 0.05,
                        tree_method = "hist",
                        max_bin = 255,
                        booster = "gbtree",
                        objective = "binary:logistic",
                        verbose = 2)

Benchmarking unfairly xgboost: Exact vs Fast Histogram

Remember you are doing the comparison for yourself and to please your mind! (or maybe you really want to compare because you want to know…)

i7–3930K: Best Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

i7–3930K: All Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

i7–7700K: Best Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

i7–7700K: All Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

Dual Xeon: Best Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

Dual Xeon: All Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

All Together: Best Runs

Unfair fast histogram xgboost is kicking exact xgboost as expected.

Need more? We will have soon a comparison versus a Baremetal Linux with a i7–7700K, and we will be also able to compare with AMD Ryzen 7 1700!

Exact xgboost and Fast Histogram xgboost training speed comparison was originally published in Data Science & Design on Medium, where people are continuing the conversation by highlighting and responding to this story.