Benchmarking xgboost fast histogram: frequency versus cores, many cores server is bad!

Laurae
Data Science & Design
16 min readApr 29, 2017

We have seen previously that many cores are helping us doing great when using xgboost exact method on large data: even an i7–7700K overclocked is doing a mere 306 seconds versus 95 seconds on a 20 core Ivy Bridge server, 222% faster!

But we were assessing a specific scenario: exact xgboost. What about fast histogram, the super fast xgboost? This is what we are going to check here.

Installation Setup

Compiling and installing xgboost with fast histogram in R is straightfoward. We do not even have to spend much time, as it is a one line compilation & install:

devtools::install_github("Laurae2/ez_xgb/R-package@2017-02-15-v1")

Do not ever use CRAN version unless you know what you are doing: some features are missing on it, especially the fast histogram method.

Now that we are setup, here is again a quick review of our machines:

  • i7–3930K, 6 cores, 12 threads, 3.9/3.5GHz, VMware virtualization
  • i7–7700K, 4 cores, 8 threads, 5.0/4.7GHz, KVM virtualization
  • Dual Quanta Freedom Ivy Bridge, 20 cores, 40 threads, 3.1/2.7GHz, KVM virtualization, NUMA fully optimized

We found out previously the following conclusion:

VMware: use number of physical cores as the number of sockets as the cores, and use 2 cores per socket as the threads (ex: i7–3930K with 6 cores becomes a 6 sockets / 2 cores virtual server)

KVM: respect host topology when NUMA is optimized to squeeze maximum performance (ex: i7–7700K with 4 cores becomes a 1 socket / 4 cores / 2 threads virtual server)

Experimental Design

We are again using xgboost on our Bosch dataset, except this time we are tuning the hyperparameters differently to let xgboost train “better” for a more real case scenario:

  • 200 iterations instead of 50 (to make training longer!)
  • 255 leaves (twice more nodes) instead of infinite
  • 12 depth instead of 6
  • Fast histogram method instead of exact method

R seed is independent of the operating system / installation, so it is not a problem for reproducibility.

As we are using the fast histogram xgboost, the training scaling is not linear, and we will notice quickly, a linear scaling is just impractical.

In addition, we are using an old version xgboost, because a recent xgboost version introduced a critical slowness issue. This is why we are using the version “2017–02–15” you can directly get from my GitHub repository.

As we are not many at all using xgboost fast histogram version, nor a recent version of xgboost, it will go unnoticed for a while:

  • Sparse format in Linux getting major threading issues (threading problem)
  • Dense / Sparse format in Windows getting major threading issues (who needs a 15x slower training…)

For the skepticals who told me scheduling might not be correctly done per socket, here is a 11 thread run of xgboost using 1S / 20C / 2T topology with a perfect scheduling:

11 thread run perfect scheduling: 10 threads making the 2nd socket busy (20 last cores = 10 physical cores), and the 11th thread is isolated on the 1st socket (8th+9th logical core = 1 physical core of the 1st socket)

Fast xgboost (per server)

We will do the same as we did for Cinebench R15, except with more details:

  • We report the timings per thread count, testing all number of threads of 1 to the number of logical cores (except for our 20 core system, where we capped the training to 20 threads only — personal experience while preparing a meetup told myself 40 cores is useless).
  • We report the efficiency (core -unscaled) versus the singlethreadedrun.
  • We report the normalized efficiency (core -scaled) versus the singlethreaded run scaled at the frequency of a multithreaded run(because a singlethreaded run is running at a higher frequency than a multithreaded run with Turbo Boost).

Our xgboost test consist on a training with the following parameter set on numeric Bosch full dataset (1,183,747 observations, 969 features, unbalanced dataset with 6,879 positive cases only), with i being the number of threads used:

gc(verbose = FALSE)
set.seed(11111)
StartTime <- System$currentTimeMillis()
temp_model <- xgb.train(data = xgb_data,
nthread = i,
nrounds = 200,
max_leaves = 255,
max_depth = 12,
eta = 0.05,
tree_method = "hist",
max_bin = 255,
booster = "gbtree",
objective = "binary:logistic",
verbose = 2)

As a reference point, our thereotical performance is the following:

  • Timings: we do not have any baseline.
  • Core efficiency: we are using the singlethreaded run as our baseline.
  • Normalized efficiency: we are using our scaled singlethreaded run as our baseline.

Take note it is a very long descriptive wall of text/pictures you will get here, be warned about it. Also, do not generalize this experimental design to any xgboost run. There are way too many parameters to control (and a lot of them cannot be controlled). However, you might find correlations between inferences together.

Will we able to find out the same conclusion? As usual, we are performing a “blind” experiment, where we do not compare directly values together but only one by one sequentially as they arrive to us per processor. Only when we compare together we are able to get to real conclusions.

Intel i7–3930K: 12 Sockets, 1 Core, 1 Thread

tl;dr: scales very poorly, fairly consistent.

Timings: as we can see, linear scaling is just unfeasible: we have an abyss from 2 threads already at around 500 seconds. Adding more threads provide a very substantial help, with the peak at 4 threads (460.71 seconds). Being 84% faster with 3 cores versus 1 thread is clearly inefficient. And hyperthreading is providing less performance, but this is expected if the multithreading is that poor.

Core efficiency: this depicts exactly what we saw just before: our dataset is too small to scale well (max: 184% efficiency), and we are already hitting the performance peak without even using all the physical cores of our CPU. A simple quad core system would be maxed out using only its physical cores!

Normalized efficiency: the normalized efficiency is more clear about the reality we just saw: we have massively diminishing returns as we add more threads.

Intel i7–3930K: 6 Sockets, 2 Cores, 1 Thread

tl;dr: scales very poorly, but a bit more stable and better for 5 threads.

Timings: it seems our 5 thread run is doing way better than our 12S/1C/1T topology (reproduced that issue 5 times in a row so…). Overall, it looks more stable, and hyperthreaded results are showing expected results (negative returns). Our 5 thread run is the best so far.

Core efficiency: again, we cannot make use of all our cores efficiently. From 5 cores, we reached the peak performance.

Normalized efficiency: still massively diminishing returns as we add more threads.

Intel i7–3930K: 1 Socket, 12 Cores, 1 Thread

tl;dr: scales very poorly, but very stable, and peaks at 5 threads.

Timings: this seems to be the most stable result and the best result we will get on this CPU: the curve timing is “nicely” parabolic.

Core efficiency: again, we cannot make use of all our cores efficiently. From 5 cores, we reached the peak performance.

Normalized efficiency: still massively diminishing returns as we add more threads.

Intel i7–3930K: Comparing the 3 Topologies

tl;dr: 1S/12C/1T setup beats the two other topologies.

Normalization per thread comparison:

  • The 1S/12C/1T and 6S/2C/1T are moving the same way in performance, except for 7 threads.
  • The baseline topology 12S/1C/1T is sometimes better, sometimes worse than the two other topologies: specifically at 2, 4, 8, and 11 threads.

Cumulated Normalization per thread comparison:

  • The 1S/12C/1T toplogy is relatively the best by far (up to 8% cumulated difference to the first contender).
  • The 12S/1C/1T and 6S/2C/1T topologies are fighting each other without a clear winner.

Detailed Data Chart Tutorial:

Tutorial for the readability of the detailed data chart:

  • Horizontal axis is representing topologies (they are at the bottom, under the form Sockets / Cores / Threads).
  • Vertical axis is representing the scaled performance versus the socket-only topology (here, 12S/1C/1T), where higher than 100% is better, and lower than 100% is worse. It affects the color.
  • Facets are representing the number of threads, from left to right, then from top to bottom. The top left is the minimum number of threads (1), while the bottom right is the maximum number of threads (12).
  • The interior of each bar has four different values: the time required for training under the combination of {Topology, Thread Count}, its scaled time versus the socket-only topology (12S/1C/1T), the rank of the topology under the same number of threads (1 is best, 2 is slower than 1, 3 is slower than 2), and the cumulated rank in the square brackets.

Detailed Data Chart Tutorial:

  • Stopping our reading early is the best (at physical cores, skip logical cores).
  • For small threaded tasks (2 to 4), using the 12S/1C/1T topology may be beneficial.
  • Medium threaded tasks (5 to 6) should be using the 1S/12C/1T topology.
  • Heavily threaded tasks (7 to 12) may go for 6S/2C/1T, but this is clearly not a recommended choice due to negative returns from adding more threads.

Intel i7–7700K: 8 Sockets, 1 Core, 1 Thread

tl;dr: scales poorly, but 1 thread is faster than any i7–3930K setup?!

Timings: with only one (1) thread, we are doing better than ANY threaded runs from our i7–3930K server. Just to show that GHz matters for fast histogram, not more cores. Using 4 cores (276.40s) yields only a 13% improvement over 2 cores (313.05s). We are already maxed out when using all physical cores.

Core efficiency: we reach only a 161.37% maximum efficiency at 4 threads. Bosch dataset seems to be really too small, as hyperthreaded cores are providing negative returns.

Normalized efficiency: the diminishing returns are less severe when compared to our i7–3930K server, but they are still massive (40% only for many threads).

Intel i7–7700K: 4 Sockets, 2 Cores, 1 Thread

tl;dr: scales poorly, potentially the fastest.

Timings: this run seems to be faster than the 8S/1C/1T topology, but the difference is extremely small to assess it correctly.

Core efficiency: can’t say much, not enough significant visual differences to spot.

Normalized efficiency: the diminishing returns are still as severe as the previous run.

Intel i7–7700K: 1 Sockets, 8 Cores, 1 Thread

tl;dr: scales poorly, worse than the baseline.

Timings: does not seem stable, and probably worse than our 8S/1C/1T topology run it seems.

Core efficiency: still not stable efficiency.

Normalized efficiency: higher normalized return? (not that significant for less than 1%…)

Intel i7–7700K: 1 Sockets, 4 Cores, 2 Threads

tl;dr: scales poorly, the worst.

Timings: probably the worst overall.

Core efficiency: still not stable efficiency.

Normalized efficiency: poor efficiency, poor timing: what else do we need to conclude about this topology?

Intel i7–7700K: Comparing the 4 Topologies

tl;dr: 4S/2C/1T setup beats the two other topologies.

Normalization per thread comparison:

  • We have only two contenders: the 8S/1C/1T topology, and the 4S/2C/1T topology.
  • The 1S/8C/1T and 1S/4C/2T topologies are out of league consistently, even for a 1% difference.

Cumulated Normalization per thread comparison:

  • The 4S/2C/1T topology is consistently doing better.
  • The 8S/1C/1T topology is a serious contender to consider.
  • The 1S/8C/1T and 1S/4C/2T topologies are “not interesting”.

Detailed Data Chart Tutorial:

Tutorial for the readability of the detailed data chart:

  • Horizontal axis is representing topologies (they are at the bottom, under the form Sockets / Cores / Threads).
  • Vertical axis is representing the scaled performance versus the socket-only topology (here, 12S/1C/1T), where higher than 100% is better, and lower than 100% is worse. It affects the color.
  • Facets are representing the number of threads, from left to right, then from top to bottom. The top left is the minimum number of threads (1), while the bottom right is the maximum number of threads (12).
  • The interior of each bar has four different values: the time required for training under the combination of {Topology, Thread Count}, its scaled time versus the socket-only topology (12S/1C/1T), the rank of the topology under the same number of threads (1 is best, 2 is slower than 1, 3 is slower than 2), and the cumulated rank in the square brackets.

Detailed Data Chart Tutorial:

  • Stopping our reading early is the best (at physical cores, skip logical cores).
  • Use 4S/2C/1T topology for squeezing less than 1% better performance overall.

Dual Quanta Freedom Ivy Bridge: 40 Sockets, 1 Core, 1 Thread

tl;dr: scales very poorly, 1 core of i7–7700K beats the 20 core server.

Timings: as we notice, we are getting inverted return from 6 threads, therefore we will skip all hyperthreaded runs. It is interesting that our latest generation Intel i7–7700K CPU is wiping the floor with only one core! Our 20 core system struggles.

Core efficiency: not efficient at all.

Normalized efficiency: very large diminish returns.

Dual Quanta Freedom Ivy Bridge: 20 Sockets, 2 Cores, 1 Thread

tl;dr: scales very poorly, as poor as our 40S/1C/1T run.

Timings: ahem… seems to be as worse as our 40S/1C/1T run.

Core efficiency: not efficient at all.

Normalized efficiency: very large diminish returns.

Dual Quanta Freedom Ivy Bridge: 1 Socket, 40 Cores, 1 Thread

tl;dr: scales very poorly, probably one of the best runs.

Timings: significantly faster from 2 threads, up to 60 seconds faster from 4 threads versus our previous runs!

Core efficiency: not efficient at all.

Normalized efficiency: very large diminish returns.

Dual Quanta Freedom Ivy Bridge: 1 Socket, 20 Cores, 2 Threads

tl;dr: scales very poorly, probably one of the best runs.

Timings: seems to be as fast as our 1S/40S/1T run!

Core efficiency: not efficient at all.

Normalized efficiency: very large diminish returns.

Dual Quanta Freedom Ivy Bridge: 2 Sockets, 20 Cores, 1 Thread

tl;dr: scales very poorly, worst run overall.

Timings: as poor as our first two slow runs, might be even worse in fact.

Core efficiency: not efficient at all.

Normalized efficiency: very large diminish returns.

Dual Quanta Freedom Ivy Bridge: 2 Sockets, 10 Cores, 2 Threads

tl;dr: scales very poorly, one of the best runs.

Timings: faster than all our previous runs for our 20 core system!

Core efficiency: not efficient at all.

Normalized efficiency: very large diminish returns.

Dual Quanta Freedom Ivy Bridge: Comparing the 6 Topologies

tl;dr: use the host topology (2S/10C/2T) for best performance.

Normalization per thread comparison:

  • 3 topologies are lagging behind: 40S/1C/1T, 20S/2C/1T, and 2S/20C/1T.
  • The 3 other topologies are all stepping ahead: 1S/40C/1T, 1S/20C/2T, 2S/10C/2T.

Cumulated Normalization per thread comparison:

  • 40S/1C/1T, 20S/2C/1T, and 2S/20C/1T topologies are lagging behind.
  • 1S/20C/2T topology seems to have scalability issues over many threads, but best most runs.
  • 2S/10C/2T and 1S/40C/1T are contenders for the best runs.

Detailed Data Chart Tutorial:

Tutorial for the readability of the detailed data chart:

  • Horizontal axis is representing topologies (they are at the bottom, under the form Sockets / Cores / Threads).
  • Vertical axis is representing the scaled performance versus the socket-only topology (here, 12S/1C/1T), where higher than 100% is better, and lower than 100% is worse. It affects the color.
  • Facets are representing the number of threads, from left to right, then from top to bottom. The top left is the minimum number of threads (1), while the bottom right is the maximum number of threads (12).
  • The interior of each bar has four different values: the time required for training under the combination of {Topology, Thread Count}, its scaled time versus the socket-only topology (12S/1C/1T), the rank of the topology under the same number of threads (1 is best, 2 is slower than 1, 3 is slower than 2), and the cumulated rank in the square brackets.

Detailed Data Chart Tutorial:

  • Not using all cores is extremely useful.
  • Squeezing 15%+ performance increase is possible using either 1S/40C/1T, 1S/20C/2T, or 2S/10C/2T topologies.
  • Maximum performance for this dataset is done using 6 threads with 2S/10C/2T topology, the replication of the host topology!

Exact xgboost (all servers together)

Do you want to check for raw performance? Here you have for your eyes:

  • i7–7700K is smashing everyone no question asked.
  • i7–3930K is sitting between a well-tuned 2x Xeon and a badly tuned 2x Xeon.
  • Our 2x Xeon (Dual Quanta Freedom Ivy Bridge) has performance varying depending on the right/wrong topology selected.

i7–7700K is overly efficient when it comes to making GHz worth:

Conclusion

If you have no idea about what topology to use… if you want a balanced server performance good for memory heavy workloads (like fast histogram xgboost):

  • VMware: anything you want will fit as performance degradation seems nearly non existant… but try to just pass the number of logical cores to the virtual machine (ex: i7–7700K with 4 cores becomes a 1 socket / 8 cores virtual server)
  • KVM: respect host topology when NUMA is optimized to squeeze maximum performance (ex: our 2x 10 server should become 2 socket / 10 cores / 2 threads) or replace cores with sockets, with cores with threads when using single socket systems (ex: i7–7700K with 4 cores becomes a 4 sockets / 2 cores / 1 thread virtual server), it will give a performance increase of about 15% to 20% when multiple threads are necessary against regular socket topologies.

You may notice our Dual Xeon has some issues… let’s clear it out here:

By changing the CPU topology, you are changing how the memory is allocated and shared.

For insurance if you have 2 NUMA nodes from 2 socket CPUs with 10 cores each and 256GB RAM evenly spaced, each CPU can access a maximum of 128GB locally, and 128GB remotely via a slower QuickPath Interconnect (QPI).

If you tell the operating system to use a 20 socket with 2 core each topology, you are saying to the guest scheduler “we have 20 memory banks of 12.8GB each”. Therefore, you are playing with lower memory banks, but very confined addresses in memory which will speed up when not using much RAM (ahem Cinebench R15), but fail very hard when using memory bound programs (ahem xgboost fast histogram). Hence, 40 socket and 20 socket failing hard in performance.

But for the case of 2 socket and 20 cores, the guest scheduler is putting the work not evenly because it cannot see it should not attempt to schedule to hyperthreaded cores. The host scheduler itself will try to fix it, causing again a slowness and wrong memory assignments. But for the 1 socket system with 40 cores, you just schedule as is and let the host scheduler do your scheduling job, so you get efficiency in allocating RAM at the right spots out of the box.

As the 1 socket 40 cores, and the 2 socket 10 cores 1 thread are similar, they perform similarly well: on 40 cores you do not have time to lose by scheduling on the virtual server, because the host can schedule “by itself” correctly and fast. Meanwhile, for our 2S/10C/1T topology, the guest scheduler is doing the work of attempting to schedule correctly, and will get it right most of the times (leaving the host scheduler with not much to do other than listening to the jobs appropriately requested).

Not satisfied enough?

We have ONE blog post which will follow this series:

  • Benchmarking Baremetal Linux vs Virtualized Windows: how slow are we? AMD Ryzen showing up!

Previous post in this series:

--

--