Virtualization & (Hyperthreading) Machine Learning Performance (Windows) (Part 2)

Laurae
Data Science & Design
4 min readNov 25, 2016

Non-Kaggle post about the impact of Virtualized CPU cores / Sockets on Machine Learning / Optimization problems, specifically on xgboost and VMware (Linux host, Windows client). I found this applies also to Virtualbox.

In the previous part, we found out using the reverse of a long setup (12 cores, 1 socket) was performing the best when we could not use a long setup directly (due to licensing issues, cf Windows Desktops).

What if we go through the licensing issues, and use Windows Server? This is what we are going to explore now.

I am still using the same configuration as previously:

  • Linux host (Ubuntu 16.04) with Intel i7–3930K (6 Physical Cores, 12 Logical Cores, 3.20 GHz, 3.80 GHz turbo boost) + 64GB RAM
  • Windows Server 2012 R2 Datacenter as Client with 54GB RAM (I got as many licenses as I want, I was a Microsoft Certified Trainer for 2 years)

The quick conclusion is that using more sockets helps a lot. The difference, however, is negligible for most tasks unless you are:

  • Trying to allow single-threaded applications to run on a physical core instead of a logical core
  • Trying to do long tasks (where even 1% better performance is to take)
  • Not running multiple concurrent VMs with conflicting resources in a single host (like Virtual Private Server reselling)

You must get something similar to this when running with 6 or 12 threads on your Virtual Machine on your Linux host:

First run: 12 Sockets, 12 Threads, 1 Core

Best: 15 Threads, 31.464s (could be fluctuation)
6 Threads: 39.845s
12 Threads: 32.186s
Average overcommit (>12): 32.199s

As we can see, not only we got a 10% performance boost when using many threads (or overallocating threads), but we also notice a major performance boost when using physical cores (260s vs 178s on single thread, anyone?)

Is there any rationale for this? If the hypervisor has the freedom to schedule on your CPU, then you should see this on hyperthreaded cores (screenshot from glances):

For instance here, we ran a singlethread xgboost. It used 2 logical cores and finishes in 178 seconds.

It might not explain why Linux is unable to see the proper workload on each core (as Windows does), but it still does its job.

Even worse is the reporting when starting a 3-threaded xgboost (Windows Task Manager vs Linux Glances):

What should we believe? 26% CPU usage (VM) vs … 5–6 busy cores? The answer is easy: Windows cannot notice properly hyperthreaded cores, and thus reports half the CPU utilization. Actually, 3 phyiscal cores were used and the workload took 68 seconds. Running a separate workload on the Linux host (to try to cover all physical cores) was slower than without the extra Virtual Machine workload.

When compared to the previous benchmarks (3 threads = 92s, 6 threads = 50s, 12 threads = 37.5s), a simple adjustment (50/37.5 = 1.33) to 50s (66.7s) shows the 68 seconds benchmark is in line with what we should have had! Also, as we noticed earlier, overcommitting physical cores improves the performance of the VM. Hence, we get diminishing return improvements as we increase the number of threads on our long VM (12 Sockets).

Therefore, a simple rule of thumb: use as many sockets as single cores you can. However, licensing issues may arise. This is THE reason why Windows Server 2016 will be licensed per core and not per socket anymore!

Second run: 6 Sockets, 12 Threads, 2 Cores

Best: 13 Threads, 31.512s (could be fluctuation)
6 Threads: 42.543s
12 Threads: 33.197s
Average overcommit (>12): 33.536s

Put it simple: you should respect the long architecture (1 Socket = 1 Core) when dealing with virtualization and Virtual Machines if you want to maximize the performance of your VM and replicate the CPU architecture of your host. A long architecture allows the hypervisor to schedule single threads onto 2 logical cores (1 physical core) correctly, for maximum performance. The less sockets, the lower the performance.

You will squeeze about:

  • 10% extra multithreading performance
  • 30% extra singlethreaded performance

Isn’t that nice to get for free?

Another rule of thumb: always run with as many threads as you have logical cores. Your time freedom will enjoy the extra 10% performance boost. Therefore, run xgboost with 12 threads when you have 6 physical cores + hyperthreading, do not run it with 6 threads.

--

--