Destroying the Myth of “number of threads = number of physical cores”
tl;dr: max performance => number of threads = number of logical/virtual cores
Edits (16/04/2017) due to requests
- Added some comparable benchmarks
- Added more pictures
- Added extra explanations
It seems to happen frequently that people are saying:
Set the number of threads to the number of physical cores, not the number of threads / logical cores for maximum performance!
This is absolutely wrong in the sense it was an expression designed for decade old systems where multithreading had such gigantic overhead that dual core systems were struggling already to get the best out of 2 physical cores: imagine with 4 logical cores!
You will see such (poor) sentence in popular repositories, xgboost / LightGBM are not exceptions.
We are going to debunk this with some benchmarks :)
However, if you want to think about a more empirical proof using statistics, just think about:
- One thread being one core (physical core).
- Two threads being one core (hyperthreading, 2 logical cores).
- The synergy effect means “1+1 > 2”.
- Read the Gaussian Correlation Inequality, and apply the synergy effect to the dart distance example.
- Generalize it to “IBM Power” CPUs for higher dimensions (up to 8 threads/dimensions, 8 logical cores).
What are we going to benchmark?
- We should benchmark three configurations: one with a small number of threads, one with a medium number of threads, one with many threads (this allows to understand the scaling more appropriately)
- A benchmark that scales very well with a large number of threads (low overhead per thread)
- A comparable benchmark similar to mathematical machine learning (optimization problems) / rendering (optimization problems)
- A benchmark not really sensitive to memory speed/quantity
In addition, we are running in virtualized environments, to make the comparison even more difficult (even more fun to compare against real results!).
We will have three machines today for our test:
- Small (baremetal): Intel i7–4600U (2c/4t, 3.3GHz single, 2.7GHz multi = “5.4GHz”) + 16GB RAM
- Medium (virtualized): Intel i7–3930K (6c/12t, 3.9GHz single, 3.5GHz multi = “21GHz”) + 54GB RAM
- Large (virtualized): Dual Quanta Freedom Ivy Bridge (2x 10c/20t, 3.1GHz single, 2.7GHz multi = “54GHz”) + 80GB RAM
We will use a single benchmark used by a lot of people that fits the optimization problem requirement. Guess what?: Cinebench R15 (15.038 more exactly).
To benchmark the threads, we will use the following scenarios:
- “Single” => Threads = 1: a single core is used.
- “MultiP” => Threads = Number of physical cores: what you are usually recommended to do (P=physical).
- “MultiL” => Threads = Number of logical cores: what you are never recommended to do (L=logical).
- “MultiO” => Threads = Number of logical cores * 2: overcommited threads, bad situation (O=overcommit).
Notation for CPUs: NAME (Cores/Threads, Single core frequency, Max Multi core frequency).
For comparable benchmarks, I took the following file from Linus Tech Tips Cinebench R15 Scores:
CPU muti thread Multi thread Position, Name, CPU, Clock Speed, Score( cb), extra info 1, LinusTech, 4x Intel Xeon 8357…docs.google.com
You can check the second fastest Cinebench R15 score using a quad socket motherboard filled with four (4) Intel Xeon E7–8890v4 24c/48t scoring 8299 (about four time faster than our best benchmark) here:
Small Baremetal Benchmark
- Baremetal setup
- i7–4600U (2c/4t, 3.3GHz single, 2.7GHz multi = “5.4GHz”)
- Haswell generation
- 16GB RAM
- Single = 1 thread (3.3GHz): 121 [Total: “3.3GHz” => 36.7 per GHz]
- MultiP = 2 threads (2.7GHz): 197 [Total: “5.4GHz” => 36.5 per GHz]
- MultiL = 4 threads (2.7GHz): 260 [Total: “5.4GHz” => 48.2 per GHz]
- MultiO = 8 threads (2.7GHz): 257 [Total: “5.4GHz” => 47.6 per GHz]
Conclusion: you better use hyperthreading, by setting the number of threads to the number of logical cores.
Medium Virtualized Benchmark
- Virtualized setup (VMware) with 12 sockets of 1 core each
- i7–3930K (6c/12t, 3.9GHz single, 3.5GHz multi = “21GHz”)
- Sandy Bridge-E generation (should be significantly slower than Haswell per GHz)
- 54GB RAM
- Single = 1 thread (3.9GHz): 116 [Total: “3.9GHz” => 29.7 per GHz]
- MultiP = 6 threads (3.5GHz): 613 [Total: “21GHz” => 29.2 per GHz]
- MultiL = 12 threads (3.5GHz): 816 [Total: “21GHz” => 38.9 per GHz]
- MultiO = 24 threads (3.5GHz): 786 [Total: “21GHz” => 37.4 per GHz]
Conclusion: you better use hyperthreading, by setting the number of threads to the number of logical cores. Virtualization hurts the performance by about 10%, and/or the CPU generation took a large hit. However, the latter seem an excluded case thanks to independent benchmarks (i7–3930K, 4GHz, 42.2 score/GHz vs our 38.86 score/GHz).
Large Virtualized Benchmark
- Virtualized setup (KVM) with 40 sockets of 1 core each
- Dual Quanta Freedom Ivy Bridge (2x 10c/20t, 3.1GHz single, 2.7GHz multi = “54GHz”)
- Ivy Bridge generation
- 80GB RAM
- Single = 1 thread (3.1GHz): 111 [Total: “3.1GHz” => 35.8 per GHz]
- MultiP = 20 threads (2.7GHz): 1777 [Total: “54GHz” => 32.9 per GHz]
- MultiL = 40 threads (2.7GHz): 2270 [Total: “54GHz” => 42.0 per GHz]
- MultiO = 80 threads (2.7GHz): 2114 [Total: “54GHz” => 39.1 per GHz]
Conclusion: you better use hyperthreading, by setting the number of threads to the number of logical cores. Virtualization does not seem to hurt the performance, or it is indiscernible (thanks Reverse Hyper-V in KVM!).
Conclusion: Benchmarks Together
Always scale your threads up to your algorithm ability to parallelize well. The best way to know if it parallelizes well is to… benchmark.
For instance, you would benchmark 100 iterations of xgboost with different number of threads to find out how many threads you should use to maximize performance!
As for our previous charts, we can resume them in charts that just need to talk by themselves.
How Quanta Freedom CPUs are comparing?
I personally have no idea as it has no identifiable name. I ran some Geekbench 4 and found that the multithreaded performance is extremely similar to E5–2650v3. Generation difference and frequency difference might explain the singlethread “huge” difference (90.7%) while the multithreaded test are nearly identical.