Benchmarking xgboost: 5GHz i7–7700K vs 20 core Xeon Ivy Bridge, and KVM/VMware Virtualization

Published in

Data Science & Design

22 min readApr 27, 2017

Recently, I decided to spend $125 into renting (for a month) an overclocked system which had promising performance when compared to a friend’s rig I helped to create.

The system I rent was the following:

i7–7700K (4c/8t, overclocked to 5.0/4.7 GHz) with watercooling
64GB RAM
2x 500 MB NVMe drives in RAID 0 (6 Gbps read…)
1 Gbps download / 250 Mbps upload
Host machine: Ubuntu 16.04 with stock kernel
Virtual machine: Windows Server 2012 R2 Datacenter

Unfortunately, it does not hold 5.0 GHz with all cores on. However, we are still able to squeeze 4.7 GHz with all cores on fire, a hefty 500 MHz bump over the original frequency table design!

I also spent a bit of money renting a 20 core machine, with the following settings:

Dual Socket System for our dual Intel Xeon: RAM banks are separated, therefore optimization for RAM is mandatory!

Quanta Freedom Ivy Bridge (2x 10c/20t, 3.1/2.7GHz) with aircooling (see the end of this blog post to your answer to “what is this processor?!”)
96GB RAM
2x 525GB SSDs
1 Gbps download/upload
Host machine: Ubuntu 16.04 with stock kernel
Virtual machine: Windows Server 2012 R2 Datacenter

RAM access is… nearly perfect when tuned!

The major issue with this server is the dual socket system: RAM banks (NUMA nodes) are separated, therefore we have optimizations to perform in order to increase the performance thanks to better memory usage.

If you have read my Virtualization series (Part 1 and Part 2) about virtualization on VMware and the socket system… we are going to try different virtualization topologies (modifying the provided sockets, cores, and threads to the virtual machines) on that new system and KVM! We are also going to compare with our stock VMware virtualization setup to get more comparisons!

We will be able to test two different things:

Cinebench R15 just for raw CPU performance.
xgboost for “real world” performance.

This blog post is too long? aka “tl;dr”: Ctrl+F “Do you want to check for raw performance?” and you will get to the conclusion!

Testing Setup

We will try our setup under different configurations to find out which one has the best performance on KVM with the last configuration respecting the CPU topology for our i7–7700K 4 core setup (it has 1 NUMA node):

8 Sockets, 1 Core, 1 Thread (total: 8 cores, 8 threads)
4 Sockets, 2 Cores, 1 Thread (total: 8 cores, 8 threads)
1 Socket, 8 Cores, 1 Thread (total: 8 cores, 8 threads)
1 Socket, 4 Cores, 2 Threads (total: 4 cores, 8 threads)

No CPU interconnect for our server, therefore we have a single node.

On the hypervisor, we are burning some cool CPU!

This is what our Virtual Machine depicts (it fails to detect the overclocking).

We are also testing our old VMware virtualization bed (VMware Workstation 12.5) with our i7–3930K 6 core setup (it has 1 NUMA node):

12 Sockets, 1 Core (total: 12 cores, 12 threads)
6 Sockets, 2 Cores (total: 12 cores, 12 threads)
1 Socket, 12 Cores (total: 12 cores, 12 threads)

Burning all the 12 CPUs!

Exact xgboost eating 100% of all our CPUs! But it reports the wrong clockrate.

The following for our Dual Quanta Freedom Ivy Bridge 20 cores setup (with KVM tuned for maximum performance: CPU pinning, NUMA awareness, NUMA locking, 2 NUMA nodes):

40 Sockets, 1 Core, 1 Thread (total: 40 cores, 40 threads)
20 Sockets, 2 Cores, 1 Thread (total: 40 cores, 40 threads)
1 Socket, 40 Cores, 1 Thread (total: 40 cores, 40 threads)
1 Socket, 20 Cores, 2 Threads (total: 20 cores, 40 threads)
2 Sockets, 20 Cores, 1 Thread (total: 40 cores, 40 threads)
2 Sockets, 10 Cores, 2 Threads (total: 20 cores, 40 threads)

numactl — hardware => it’s twice more expensive to miss a NUMA R/W!

numastat => 6.82% times the RAM must be fetched from another NUMA node!

Average load is 87% because the workload is not “fat enough” for our machine.

Exact xgboost can make use of our 20 physical cores (40 threads)! Not perfectly 100%, but still 87%!

Cinebench R15 can make usage of 100% of our cores!

We will test under two scenarii:

Cinebench R15 (for multithreading perfect scaling purposes), ran 11 times and took the median picture (6th best run).
Our xgboost benchmark on the sparse Bosch dataset, but with 50 iterations to make it stable with the exact method (don’t compare directly with the previous series, we are using the exact method).

Obviously, Cinebench R15 will be run on all threads (8, 12, or 40), while Bosch dataset will be used for training with xgboost on a wide range of threads (1 to 8, 1 to 12, or 1 to 40) to check scalability.

xgboost is installed with the following settings:

-O3 -mtune=native optimization for xgboost, using devtools::install_github("Laurae2/ez_xgb/R-package@2017-02-15-v1", force = TRUE)
Microsoft R Open 3.3.3 + Intel MKL
Unfortunately, our KVM is so “old” that it does not allow to detect Kaby Lake CPUs. Instead, it passes Broadwell:

We will also use many pictures… because they speak better themselves than blocks of text!

Cinebench R15 Tests (per server)

We are presenting the pictures when using Sockets as the number of cores.

It is important to not extrapolate the results too fast. They are representative for our specific workload under our own specifications.

i7–3930K (VMware Virtualization)

Scores from left to right: 806, 777, 802.

The results are below:

12S / 1C score: 806
6S / 2C score: 777
1S / 12C score: 802

We know already from the past that using a 6 socket and 2 core topology is slower than 12 socket and 1 core topology for our VMware virtualization case.

If you followed Destroying the Myth of “number of threads = number of physical cores”, you will notice quickly the score (806) is a bit below that what we had at that time (816). That discrepancy is reproducible which means we could have had a series of lucky runs to get 816.

i7–3930K (3.9/3.3), 12 sockets, 1 core, 1 thread (12c/12t), scoring a median of 806.

i7–7700K (KVM Virtualization)

Scores from left to right: 1020, 1011, 1015, 1015

The results are below:

8S / 1C / 1T score: 1020
4S / 2C / 1T score: 1011
1S / 8C / 1T score: 1015
1S / 4C / 2T score: 1015

The score per GHz is insane. We have over 50 score per GHz, which is the high end bleeding edge performance and efficiency. It seems we might have hit a brick from the topology, as changes are not seeming to provide any large meaningful change on Cinebench R15. However, the quad socket with dual core topology is performing slightly poorly, followed by the quad core with dual thread topology.

i7–7700K (5.0/4.7), 8 sockets, 1 core, 1 thread (8c/8t), scoring a median of 1020.

20 core Ivy Bridge (KVM Virtualization)

Scores from left to right: 2292, 2248, 2261, 2251, 2242, 2284

The results are below:

40S / 1C / 1T score: 2292
20S / 2C / 1T score: 2248
1S / 40C / 1T score: 2261
1S / 20C / 2T score: 2251
2S / 20C / 1T score: 2242
2S / 10C / 2T score: 2284

Our Dual Quanta Freedom is smashing our Sandy Bridge-E i7–3930K at the score per GHz, as expected with an architecture difference between them (Sandy Bridge vs Ivy Bridge).

We notice also our 40 socket with 1 core topology is standing in front of every other topology tested. In addition, the results are fairly consistent with what we found out with our i7–7700K:

Using a socket with dual core topology (20S / 2C / 1T) is slower overall;
Using a full core topology (1S / 40C / 1T) is slower overall, but still faster than a using a socket with dual core topology;
Inverting the host socket/logical core topology (2S / 20C / 1T) leads to the worst performance;
Approximating the replication of the host topology (1S / 20C / 2T) is as slow as the dual core topology;
Specific to our dual socket system: replicating the host topology (2S / 10C / 2T) is very fast and next in performance to our best setup (less error prone);
Using only sockets (40S / 1C / 1T) to let the host scheduler do all the work is even faster!

In addition, our host system topology replication is giving very consistent results, landing in the range of [2272, 2288] during the 11 runs which is unexpected.

Also, even with our KVM tuning / NUMA optimizations, using a poor topology leads to a worse performance than our untuned virtualization scheme (we had before optimizations, a score of 2270 on Cinebench R15).

Dual Quanta Freedom (3.1/2.7), 40 sockets, 1 core, 1 thread (40c/40t), scoring a median of 2292.

Dual Quanta Freedom (3.1/2.7), 2 sockets, 10 cores, 2 threads (20c/40t), scoring a peak of 2288!

Cinebench R15 (all servers together)

We are first plotting all topologies together according to their performance, to get a quick oversight on all our data we gathered.

Cinebench R15 Absolute Scores and Scaled GHz Scores on i7–3930K, i7–7700K, and Dual Quanta Freedom

Obviously, this is difficult to understand for our comparison: we are trying to assess the performance depending on the topology, not depending on the processor! (nevertheless, we quickly notice our i7–7700K is blazing our i7–3930K like food).

What about we set 2n/1/1 topology as the reference? The results are below:

It seems the 2n/1/1 topology is the winner on all our 3 servers, even after taking the median after 11 runs. One “would” recommend using 2n/1/1 topology right after this, but we did not hammer into the RAM read/write, which is exactly what xgboost is going to do for us.

Will we be able to reproduce this using xgboost? Quick answer: no.

Exact xgboost (per server)

We will do the same as we did for Cinebench R15, except with more details:

We report the timings per thread count, testing all number of threads of 1 to the number of logical cores (if we have 40 cores… we have 40 tests).
We report the efficiency (core -unscaled) versus the singlethreaded run.
We report the normalized efficiency (core -scaled) versus the singlethreaded run scaled at the frequency of a multithreaded run (because a singlethreaded run is running at a higher frequency than a multithreaded run with Turbo Boost).

Our xgboost test consist on a training with the following parameter set on numeric Bosch full dataset (1,183,747 observations, 969 features, unbalanced dataset with 6,879 positive cases only), with i being the number of threads used:

gc(verbose = FALSE)
set.seed(11111)
StartTime <- System$currentTimeMillis()
temp_model <- xgb.train(data = xgb_data,
                       nthread = i,
                       nrounds = 50,
                       max_leaves = 255,
                       max_depth = 6,
                       eta = 0.20,
                       tree_method = "exact",
                       booster = "gbtree",
                       objective = "binary:logistic",
                       verbose = 2)

As a reference point, our thereotical performance is the following:

Timings: we do not have any baseline.
Core efficiency: we are using the singlethreaded run as our baseline.
Normalized efficiency: we are using our scaled singlethreaded run as our baseline.

Take note it is a very long descriptive wall of text/pictures you will get here, be warned about it. Also, do not generalize this experimental design to any xgboost run. There are way too many parameters to control (and a lot of them cannot be controlled). However, you might find correlations between inferences together.

We will be using the i7–3930K run as our “tutorial” example.

DISCLAIMER: the analysis experiment was done sequentially and blindly. The comparisons were done visually first before getting on the numbers for more accurate analysis and creating targeted comparison charts.

Intel i7–3930K: 12 Sockets, 1 Core, 1 Thread

tl;dr: scales well as our baseline. Up to 118% efficiency per core.

Timings: we can notice below we need a staggering 2608 seconds to perform our 50 iteration test on a single thread. Our best run requires 409 seconds, while our physical threading (6 core) requires 540.14 seconds (538% faster, 24.3% slower depending on how you see it).

Core efficiency: we notice we do not have a linear scaling of our thread count. However, when hyperthreading kicks in, we are able to overachieve the theoretical performance of 6 physical cores by an absolute 38.36% at 12 threads. That’s a 6.38x speed up using 12 threads!

Normalized efficiency: our singlethreaded run has an efficiency of 111.43%, as we have an increased 11.43% clock rate when using a single core. Our worst efficiency is achieved at 6 physical cores (89.69%), while we are able to overachieve our theoretical performance from 9 threads (up to 18.55% faster cores at 12 threads).

Intel i7–3930K: 6 Sockets, 2 Cores, 1 Thread

tl;dr: up to 117% efficiency per core, but our comparison is 8.3% faster initially. Therefore, we are beating the 12S/1S/1T setup.

Timings: our run only needs 2409 seconds (8.3% faster) using 6S/2C/1T versus 2608 seconds using 12S/2C/1T topology. Compared to our previous best run, we are able to nail 383 seconds (6.8% faster) versus 409 seconds. We may notice our 11 thread run is slower than our 10 thread run, this behavior was reproduced 10 times in a row (to check) and does not have any explanation. Our phyiscal core run (6 core) is 1 second faster, but we can discard it as a sampling anomaly to not make wrong inferences.

Core efficiency: we are able to beat the theoretical performance of 6 physical cores by an absolute 28.76% at 12 threads. That’s a 6.29x speed up using 12 threads!

Normalized efficiency: our worst efficiency is achieved at 6 physical cores (82.97%), while we are able to overachieve our theoretical performance from 9 threads again (up to 16.77% faster cores at 12 threads).

Intel i7–3930K: 1 Socket, 12 Cores, 1 Thread

tl;dr: up to 122% efficiency per core, but it is as slow as the 12S/1C/1T setup initially. Therefore, it is slower than our 6S/2C/1T setup.

Timings: our run is as slow as our run using 12S/2C/1T topology, but let’s not infer too quickly.

Core efficiency: we are able to beat the theoretical performance of 6 physical cores by an absolute 56.34% at 12 threads. That’s a 6.56x speed up using 12 threads!

Normalized efficiency: our worst efficiency is achieved at 6 physical cores (87.93%), while we are able to overachieve our theoretical performance a bit earlier, with 8 threads (up to 21.89% faster cores at 12 threads).

Intel i7–3930K: Comparing the 3 Topologies

tl;dr: 6S/2C/1T setup beats the two other topologies.

Normalization per thread comparison:

The topology 6S/2C/1T seems to be faster overall.
The topology 1S/12C/1T is slower without hyperthreading, but faster with hyperthreading than the baseline.
The baseline topology 12S/1C/1T is underperforming for hyperthreading.

Cumulated Normalization per thread comparison:

The topology 6S/2C/1T is faster by over a cumulated 40% over 12 threads, beating the other topologies. It is the recommended setup.
The topology 12S/1C/1T is probably the most balanced topology if the choice had to be done between 12S/1C/1T and 1S/12C/1T. The former is recommended for non-hyperthreaded tasks, while the latter is recommended for hyperthreaded tasks.

Detailed Data Chart Tutorial:

Tutorial for the readability of the detailed data chart:

Horizontal axis is representing topologies (they are at the bottom, under the form Sockets / Cores / Threads).
Vertical axis is representing the scaled performance versus the socket-only topology (here, 12S/1C/1T), where higher than 100% is better, and lower than 100% is worse. It affects the color.
Facets are representing the number of threads, from left to right, then from top to bottom. The top left is the minimum number of threads (1), while the bottom right is the maximum number of threads (12).
The interior of each bar has four different values: the time required for training under the combination of {Topology, Thread Count}, its scaled time versus the socket-only topology (12S/1C/1T), the rank of the topology under the same number of threads (1 is best, 2 is slower than 1, 3 is slower than 2), and the cumulated rank in the square brackets.

Detailed Data Chart Tutorial:

6S/2C/1T is best hands down. Nothing else to say.

Intel i7–7700K: 8 Sockets, 1 Core, 1 Thread

tl;dr: scales well as our baseline. Hyperthreading matters for an extra 85% in performance.

Timings: we need only 1235 seconds to do our 50 iteration test on a single thread. Our best run requires 310 seconds, while our physical threading (6 core) requires 392 seconds (215% faster, 26.4% slower depending on how you see it).

Core efficiency: still no linear scaling of our thread count. We nearly achieve a linear scaling using hyperthreading, by reaching 3.98x speed up on 8 threads.

Normalized efficiency: our singlethreaded run has an efficiency of 106.38%, as we have an increased 6.38% clock rate when using a single core. Our worst efficiency is achieved at 4 physical cores (83.74%). Our hyperthreaded run do not beat a single core in normalized efficiency.

Intel i7–7700K: 4 Sockets, 2 Cores, 1 Thread

tl;dr: worse than our 8S/1C/1T run overall, not the best at all.

Timings: our run is too much similar to the 8S/1C/1T setup to infer differences visually.

Core efficiency: we reach only a 3.86x speed up on 8 threads. This is not good if our timings did not change much.

Normalized efficiency: the efficiency is worse than our 8S/1C/1T run. Poor performance…

Intel i7–7700K: 1 Socket, 8 Cores, 1 Thread

tl;dr: should be the fastest run because it beats in specifications our 8S/1C/1T run by a very small margin (estimated: about 1% better per thread).

Timings: the run seems better than our 8S/1C/1T run, but can not infer easily from timings.

Core efficiency: we manage to beat the linear scaling by reaching a 4.01x speed up on 8 threads. We can suppose our run to be faster than our 8S/1C/1T run.

Normalized efficiency: our efficiency seems better overall than our 8S/1C/1T run. This run should be the fastest, if not considering the 1S/4C/2T run yet.

Intel i7–7700K: 1 Socket, 4 Cores, 2 Threads

tl;dr: it seems to be the fastest run when looking at our timings, but the efficiency charts are telling us otherwise. We have to look deeper with the direct comparison between all four runs.

Timings: this run seems even better than our 1S/8C/1T run, but can’t infer as the difference is too small.

Core efficiency: we reach only a scaling of 3.94x speed up on 8 threads. Can’t infer yet versus our best run.

Normalized efficiency: can’t infer yet versus our best run.

Intel i7–7700K: Comparing the 4 Topologies

tl;dr: choose between 1S/8C/1T (high threading) and 1S/4S/2T (low threading), but they are so similar you will not see much difference (1%?).

Normalization per thread comparison:

The baseline topology 8S/1C/1T seems to is very fast already.
For maximum overall performance, the topology 1S/8C/1T is the recommended choice.
The contender topology 1S/4C/2T is a direct competitor to 1S/8C/1T.
Do not ever use 4S/2C/1T topology, it is slower by a large magnitude.

Cumulated Normalization per thread comparison:

It is clear our 4S/2C/1T is getting smoked by every other topologies.
Only 1S/8C/1T and 1S/4S/2T are fighting each other. We recommended choosing the one that suits you the best for single CPUs with only one NUMA node.

Detailed Data Chart Tutorial:

Tutorial for the readability of the detailed data chart:

Horizontal axis is representing topologies (they are at the bottom, under the form Sockets / Cores / Threads).
Vertical axis is representing the scaled performance versus the socket-only topology (here, 12S/1C/1T), where higher than 100% is better, and lower than 100% is worse. It affects the color.
Facets are representing the number of threads, from left to right, then from top to bottom. The top left is the minimum number of threads (1), while the bottom right is the maximum number of threads (12).
The interior of each bar has four different values: the time required for training under the combination of {Topology, Thread Count}, its scaled time versus the socket-only topology (12S/1C/1T), the rank of the topology under the same number of threads (1 is best, 2 is slower than 1, 3 is slower than 2), and the cumulated rank in the square brackets.

Detailed Data Chart Tutorial:

1S/8C/1T and 1S/4C/2T are the preferred choices. For singlethreaded tasks, 1S/4C/2T may squeeze even more performance (nearly negligible difference, from 0.2% to 1.7%).

Dual Quanta Freedom Ivy Bridge: 40 Sockets, 1 Core, 1 Thread

tl;dr: scales to 80-90% of our singlethreaded performance due to NUMA nodes and heavy multithreading usage, but exhibits an exceptional performance (198% faster).

Timings: we need 1652 seconds to do our 50 iteration test on a single thread, which is faster than our i7–3930K by far (57.9% faster). Our best run requires ONLY 103 seconds, and hyperthreading kicks in for an increased physical core performance of 58.3% (which is huge!). We might notice the plateau from 11–20 cores, as it is due to the addition of the second NUMA node during training: memory fetching, if done at the wrong NUMA node, is twice as expensive as usual.

Core efficiency: we can’t expect to scale linearly with so many cores and two NUMA nodes. Our best efficiency is 16.06x the performance of a single core.

Normalized efficiency: our singlethreaded run has an efficiency of 114.8%, as we have an increased 14.8% clock rate when using a single core. The NUMA nodes impact heavily the multithreaded performance, as we are hitting a 58.2% abyss in performance when using all physical cores. Hyperthreading kicks it back to 92.2%, therefore this is not a multithreading overhead programming issue (if we were wrong, the performance would have kept decreasing as we were adding logical cores instead of physical cores).

Dual Quanta Freedom Ivy Bridge: 20 Sockets, 2 Cores, 1 Thread

tl;dr: beats 40S/1S/1T overall for multithreading.

Timings: hard to infer with so many numbers, we may notice we are way faster for heavy multithreading than our 40S/1S/1T run (up to 8% faster).

Core efficiency: we have an increase of our best efficiency to 17.31x the performance of a single core. We notice a threading difficulty on even numbers of threads when hyperthreading kicks in, which is then swapped on the odd numbers of threads with even more threads.

Normalized efficiency: we nearly reach back our linear scaling with 99.4% efficiency!

Dual Quanta Freedom Ivy Bridge: 1 Socket, 40 Cores, 1 Thread

tl;dr: difficult to infer, we skip comments.

Timings: similar to 20S/2C/1T run.

Core efficiency: extremely strange behavior of odd number of threads with using hyperthreading!

Normalized efficiency: we eventually beat our linear scaling, with a 100.4% scaling. But the picture is really strange for hyperthreading (39 cores is better than 40 cores), and the behavior of even number of threads is very consistent.

Dual Quanta Freedom Ivy Bridge: 1 Socket, 20 Cores, 2 Threads

tl;dr: difficult to infer, we skip comments.

Timings: similar to 20S/2C/1T run.

Core efficiency: can’t decipher the hyperthreaded behavior.

Normalized efficiency: can’t decipher the hyperthreaded behavior nor the peak again at 39 threads which beats our linear scaling to 100.7%.

Dual Quanta Freedom Ivy Bridge: 2 Sockets, 20 Cores, 1 Thread

tl;dr: most consistent/stable run and very similar to our 20S/2C/1T run. Therefore, this run might be the best.

Timings: similar to 20S/2C/1T run.

Core efficiency: hyperthreading behavior is more consistent early, but still inconsistent from 32 threads.

Normalized efficiency: can’t explain hyperthreading consistency from 32 threads.

Dual Quanta Freedom Ivy Bridge: 2 Sockets, 10 Cores, 2 Threads

tl;dr: difficult to infer, we skip comments.

Timings: similar to 20S/2C/1T run.

Core efficiency: hyperthreading improvements are way more consistent here!

Normalized efficiency: hyperthreading improvements are way more consistent here!

Dual Quanta Freedom Ivy Bridge: Comparing the 6 Topologies

tl;dr: 20S/2C/1T is balanced, 2S/10C/2T for many threads, 2S/20C/1T for small threads.

Normalization per thread comparison:

The baseline topology 40S/1C/1T is slow, especially for hyperthreading.
If using hyperthreaded tasks, do not go for the baseline topology 40S/1C/1T.
For small number of threads, replicating the host topology without threads seems the best way to go (2S/20C/1T).
1S/20C/2T and 1S/40C/1T have issues with physical core performance.
20S/2C/1T and 2S/10C/2T seem to wipe the floor. We can check this on the next picture.

Cumulated Normalization per thread comparison:

2S/10C/2T, the host topology replication, is wiping everything.
20S/2C/1T, the contender we found just before, is providing excellent performance for physical cores, and lower performance for heavy threading than the host topology (2S/10C/2T).
2S/20C/1T is excellent for a low amount of threads.
There are no practical uses to choose 1S/40S/1T, 1S/20C/2T, or 40S/1C/1T, over the three other topologies we described earlier.

Detailed Data Chart Tutorial:

Tutorial for the readability of the detailed data chart:

Horizontal axis is representing topologies (they are at the bottom, under the form Sockets / Cores / Threads).
Vertical axis is representing the scaled performance versus the socket-only topology (here, 12S/1C/1T), where higher than 100% is better, and lower than 100% is worse. It affects the color.
Facets are representing the number of threads, from left to right, then from top to bottom. The top left is the minimum number of threads (1), while the bottom right is the maximum number of threads (12).
The interior of each bar has four different values: the time required for training under the combination of {Topology, Thread Count}, its scaled time versus the socket-only topology (12S/1C/1T), the rank of the topology under the same number of threads (1 is best, 2 is slower than 1, 3 is slower than 2), and the cumulated rank in the square brackets.

Detailed Data Chart Tutorial:

The 20S/2C/1T topology is the balanced choice.
The 2S/10C/2T topology is very strong for many threads.
The 2S/20C/1T topology is very strong for small threads.

Exact xgboost (all servers together)

I do not think we need to compare all topologies together, as we already analyzed each model for each CPU, and got similar conclusions (using the host topology for KVM, replicating the CPU topology using sockets / cores for VMware).

Do you want to check for raw performance? Here you have for your eyes:

Dual Xeon is just smoking the floor.
i7–7700K is just faster overall with its only 4 physical cores versus the 6 physical cores of the i7–3930K (could be virtualization doing POOR)

In addition, running a frequency 61% faster (i7–7700K vs Dual Xeon) is yielding only a 36.2% improvement:

Conclusion

If you have no idea about what topology to use… if you want a balanced server performance:

VMware: use number of physical cores as the number of sockets as the cores, and use 2 cores per socket as the threads (ex: i7–3930K with 6 cores becomes a 6 sockets / 2 cores virtual server)
KVM: respect host topology when NUMA is optimized to squeeze maximum performance (ex: i7–7700K with 4 cores becomes a 1 socket / 4 cores / 2 threads virtual server)

For pure raw CPU performance, one should use the 2n/1/1 scheme: as many sockets as there are logical cores, only 1 core per socket, and only 1 thread per core.

As of the virtualization choice you should select… you know you will go for KVM and not VMware just by looking at the results. There is no reason to not to.

I also get a frequent question.. which is “how much did it cost?”: I spent about 500€ for this benchmarking and another longer benchmark (of 3 months) which is for the xgboost vs LightGBM Meetup.

We have two blog posts which will follow this:

Benchmarking xgboost fast histogram: frequency versus cores, many cores server is bad!
Benchmarking Baremetal Linux vs Virtualized Windows: how slow are we? AMD Ryzen showing up!

Benchmarking xgboost: 5GHz i7–7700K vs 20 core Xeon Ivy Bridge, and KVM/VMware Virtualization

Testing Setup

Cinebench R15 Tests (per server)

i7–3930K (VMware Virtualization)

i7–7700K (KVM Virtualization)

20 core Ivy Bridge (KVM Virtualization)

Cinebench R15 (all servers together)

Exact xgboost (per server)

Intel i7–3930K: 12 Sockets, 1 Core, 1 Thread

Intel i7–3930K: 6 Sockets, 2 Cores, 1 Thread

Intel i7–3930K: 1 Socket, 12 Cores, 1 Thread

Intel i7–3930K: Comparing the 3 Topologies

Intel i7–7700K: 8 Sockets, 1 Core, 1 Thread

Intel i7–7700K: 4 Sockets, 2 Cores, 1 Thread

Intel i7–7700K: 1 Socket, 8 Cores, 1 Thread

Intel i7–7700K: 1 Socket, 4 Cores, 2 Threads

Intel i7–7700K: Comparing the 4 Topologies

Dual Quanta Freedom Ivy Bridge: 40 Sockets, 1 Core, 1 Thread

Dual Quanta Freedom Ivy Bridge: 20 Sockets, 2 Cores, 1 Thread

Dual Quanta Freedom Ivy Bridge: 1 Socket, 40 Cores, 1 Thread

Dual Quanta Freedom Ivy Bridge: 1 Socket, 20 Cores, 2 Threads

Dual Quanta Freedom Ivy Bridge: 2 Sockets, 20 Cores, 1 Thread

Dual Quanta Freedom Ivy Bridge: 2 Sockets, 10 Cores, 2 Threads

Dual Quanta Freedom Ivy Bridge: Comparing the 6 Topologies

Exact xgboost (all servers together)

Conclusion

Written by Laurae