Destroying the Myth of “number of threads = number of physical cores”

tl;dr: max performance => number of threads = number of logical/virtual cores

Edits (16/04/2017) due to requests

  • Added some comparable benchmarks
  • Added more pictures
  • Added extra explanations

Introduction

It seems to happen frequently that people are saying:

Set the number of threads to the number of physical cores, not the number of threads / logical cores for maximum performance!

This is absolutely wrong in the sense it was an expression designed for decade old systems where multithreading had such gigantic overhead that dual core systems were struggling already to get the best out of 2 physical cores: imagine with 4 logical cores!

You will see such (poor) sentence in popular repositories, xgboost / LightGBM are not exceptions.

We are going to debunk this with some benchmarks :)

However, if you want to think about a more empirical proof using statistics, just think about:

  • One thread being one core (physical core).
  • Two threads being one core (hyperthreading, 2 logical cores).
  • The synergy effect means “1+1 > 2”.
  • Read the Gaussian Correlation Inequality, and apply the synergy effect to the dart distance example.
  • Generalize it to “IBM Power” CPUs for higher dimensions (up to 8 threads/dimensions, 8 logical cores).

Benchmarking Method

What are we going to benchmark?

  • We should benchmark three configurations: one with a small number of threads, one with a medium number of threads, one with many threads (this allows to understand the scaling more appropriately)
  • A benchmark that scales very well with a large number of threads (low overhead per thread)
  • A comparable benchmark similar to mathematical machine learning (optimization problems) / rendering (optimization problems)
  • A benchmark not really sensitive to memory speed/quantity

In addition, we are running in virtualized environments, to make the comparison even more difficult (even more fun to compare against real results!).

We will have three machines today for our test:

  • Small (baremetal): Intel i7–4600U (2c/4t, 3.3GHz single, 2.7GHz multi = “5.4GHz”) + 16GB RAM
  • Medium (virtualized): Intel i7–3930K (6c/12t, 3.9GHz single, 3.5GHz multi = “21GHz”) + 54GB RAM
  • Large (virtualized): Dual Quanta Freedom Ivy Bridge (2x 10c/20t, 3.1GHz single, 2.7GHz multi = “54GHz”) + 80GB RAM

We will use a single benchmark used by a lot of people that fits the optimization problem requirement. Guess what?: Cinebench R15 (15.038 more exactly).

To benchmark the threads, we will use the following scenarios:

  • Single” => Threads = 1: a single core is used.
  • MultiP” => Threads = Number of physical cores: what you are usually recommended to do (P=physical).
  • MultiL” => Threads = Number of logical cores: what you are never recommended to do (L=logical).
  • MultiO” => Threads = Number of logical cores * 2: overcommited threads, bad situation (O=overcommit).

Notation for CPUs: NAME (Cores/Threads, Single core frequency, Max Multi core frequency).

For comparable benchmarks, I took the following file from Linus Tech Tips Cinebench R15 Scores:

You can check the second fastest Cinebench R15 score using a quad socket motherboard filled with four (4) Intel Xeon E7–8890v4 24c/48t scoring 8299 (about four time faster than our best benchmark) here:


Small Baremetal Benchmark

Computer specs:

  • Baremetal setup
  • i7–4600U (2c/4t, 3.3GHz single, 2.7GHz multi = “5.4GHz”)
  • Haswell generation
  • 16GB RAM

Benchmark results:

  • Single = 1 thread (3.3GHz): 121 [Total: “3.3GHz” => 36.7 per GHz]
  • MultiP = 2 threads (2.7GHz): 197 [Total: “5.4GHz” => 36.5 per GHz]
  • MultiL = 4 threads (2.7GHz): 260 [Total: “5.4GHz” => 48.2 per GHz]
  • MultiO = 8 threads (2.7GHz): 257 [Total: “5.4GHz” => 47.6 per GHz]
Where does 260 fits and what the closest CPU both in architecture / frequency?: i5–3210M (2c/4t, 3.1/2.9 GHz)
2 threads is obviously very slightly slower due to the multithreading overhead. So does 8 threads.

Conclusion: you better use hyperthreading, by setting the number of threads to the number of logical cores.

i7–4600U (3.3/2.7), single thread, scoring 121.
i7–4600U (3.3/2.7), MultiP (2 threads), scoring 197.
i7–4600U (3.3/2.7), MultiL (4 threads), scoring 260.
i7–4600U (3.3/2.7), MultiO (8 threads), scoring 257.

Medium Virtualized Benchmark

Computer specs:

  • Virtualized setup (VMware) with 12 sockets of 1 core each
  • i7–3930K (6c/12t, 3.9GHz single, 3.5GHz multi = “21GHz”)
  • Sandy Bridge-E generation (should be significantly slower than Haswell per GHz)
  • 54GB RAM

Benchmark results:

  • Single = 1 thread (3.9GHz): 116 [Total: “3.9GHz” => 29.7 per GHz]
  • MultiP = 6 threads (3.5GHz): 613 [Total: “21GHz” => 29.2 per GHz]
  • MultiL = 12 threads (3.5GHz): 816 [Total: “21GHz” => 38.9 per GHz]
  • MultiO = 24 threads (3.5GHz): 786 [Total: “21GHz” => 37.4 per GHz]
As fast as overclocked i5–3570K to 5.703GHz??!!
Adding 500 MHz seems to yield substantial improvements, but it has a 42.2 per GHz score! Our virtualization is handling something very poorly by hurt performance that much (by 10%).
6 threads is obviously slightly slower due to the multithreading overhead. So does 12 threads.

Conclusion: you better use hyperthreading, by setting the number of threads to the number of logical cores. Virtualization hurts the performance by about 10%, and/or the CPU generation took a large hit. However, the latter seem an excluded case thanks to independent benchmarks (i7–3930K, 4GHz, 42.2 score/GHz vs our 38.86 score/GHz).

i7–3930K (3.9/3.5), single thread, scoring 116.
i7–3930K (3.9/3.5), MultiP (6 threads), scoring 613.
i7–3930K (3.9/3.5), MultiL (12 threads), scoring 816.
i7–3930K (3.9/3.5), MultiO (24 threads), scoring 786.

Large Virtualized Benchmark

Computer specs:

  • Virtualized setup (KVM) with 40 sockets of 1 core each
  • Dual Quanta Freedom Ivy Bridge (2x 10c/20t, 3.1GHz single, 2.7GHz multi = “54GHz”)
  • Ivy Bridge generation
  • 80GB RAM

Benchmark results:

  • Single = 1 thread (3.1GHz): 111 [Total: “3.1GHz” => 35.8 per GHz]
  • MultiP = 20 threads (2.7GHz): 1777 [Total: “54GHz” => 32.9 per GHz]
  • MultiL = 40 threads (2.7GHz): 2270 [Total: “54GHz” => 42.0 per GHz]
  • MultiO = 80 threads (2.7GHz): 2114 [Total: “54GHz” => 39.1 per GHz]
There isn’t real competition here… about “how much bucks” you have to spend. All the CPUs, supposing launch prices, are priced over €1,500 each for a single piece. Imagine getting two pieces, or even four pieces. E5–2687W is an extremely powerful Xeon CPU as it is designed to maximize frequency while keeping a lot of cores, retailing on launch at ~$1,900.
As we now scale into many threads (for CPUs), the scaling is significantly poorer for 20 threads than for a single thread (91.9% efficiency only). The efficiency when overcommitting threads is also lower (93.1%).

Conclusion: you better use hyperthreading, by setting the number of threads to the number of logical cores. Virtualization does not seem to hurt the performance, or it is indiscernible (thanks Reverse Hyper-V in KVM!).

Dual Quanta Freedom Ivy Bridge (3.1/2.7), single thread, scoring 111.
Dual Quanta Freedom Ivy Bridge (3.1/2.7), MultiP (20 threads), scoring 1777.
Dual Quanta Freedom Ivy Bridge (3.1/2.7), MultiL (40 threads), scoring 2270.
Dual Quanta Freedom Ivy Bridge (3.1/2.7), MultiO (80 threads), scoring 2114.

Conclusion: Benchmarks Together

Always scale your threads up to your algorithm ability to parallelize well. The best way to know if it parallelizes well is to… benchmark.

For instance, you would benchmark 100 iterations of xgboost with different number of threads to find out how many threads you should use to maximize performance!

In which situation are you… Vertically? (seconds?) Horizontally? (how well it scales?) In which facet? (how well?). In the of xgboost exact vs fast histogram methods, the exact method has the advantage to scale well as long as the dataset size is large enough (and we suppose “large enough” very quickly). For the fast histogram method, it takes a significantly larger dataset size to scale well (even with a 1M x 1K matrix, you will not saturate CPUs in fast histogram, but will with the exact method).

As for our previous charts, we can resume them in charts that just need to talk by themselves.

When it comes to efficiency, the small machine is the unbeatable king. Although a generation is separating our Small machine to our Large machine , the latter performs exceptionally well and would even beat our Small machine if we could scale the frequency for matching (38.12 instead of 35.81, versus 36.67)
We can clearly notice hyperthreading providing a heft 31% performance boost! This is way better than our promised “poor performance” when committing to logical cores. However, overcommitting logical cores should be avoided: it lowers the performance as the number of threads absolutely increases.

How Quanta Freedom CPUs are comparing?

I personally have no idea as it has no identifiable name. I ran some Geekbench 4 and found that the multithreaded performance is extremely similar to E5–2650v3. Generation difference and frequency difference might explain the singlethread “huge” difference (90.7%) while the multithreaded test are nearly identical.

Identical CPU except the frequency. E5–2650v3 are 3.0/2.6GHz CPUs, while the Quanta Freedom I use are 3.1/2.7GHz and one generation behind (Ivy Bridge, v2 Xeons).
Much identical? Some are way too different but still…
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.