Virtualization & (Hyperthreading) Machine Learning Performance (Windows) (Part 1)

Published in

Data Science & Design

4 min readNov 14, 2016

Non-Kaggle post about the impact of Virtualized CPU cores / Sockets on Machine Learning / Optimization problems, specifically on xgboost and VMware (Linux host, Windows client). I found this applies also to Virtualbox.

I’ve experimented a lot with virtualization and hyperthreading the last 3 years while creating VMs (Virtual Machines). The difference in performance can be large, although not significant for most workloads.

There are workloads which are very long, one that is reminiscent of the current present is parallelized Machine Learning & Optimization problem solving. One good example is xgboost (Extreme Gradient Boosting). In its most known form, it is a well-parallelized gradient boosted trees algorithm, outperforming almost all its competitors (except the most recently known LightGBM by Microsoft).

I ran different workloads on a server equipped with an Intel i7–3930K (6 Physical Cores, 12 Logical Cores, 3.20 GHz, 3.80 GHz turbo boost), 64GB RAM, and a Linux host operating system. I then installed VMware on the host, and a Windows 8 client to benchmark the performance of xgboost on a large data set (we are using Bosch’s numeric data set). The file was pre-parsed myself for this competition for R as a RDS file, as a sparse matrix whose blanks are replacing NAs (90%+ sparse). As a dense matrix, the data set requires approximately 9GB RAM.

To make sure caching does not take into action, a simple xgboost (compiled from source on 09/22/2016) with 12 threads is run everytime before doing any benchmark. Then, 5 consecutive runs of 5 rounds is run for N amount of threads, with N = [1, 2, …, 17, 18]. Although going over 12 is futile enough, I wanted to check whether overcommitting threads could be useful (tip: it is not — but not exactly true). Also, the client was restarted after every benchmark to clear caches. In addition, a 5-minute of inactivity was performed on the Windows client to avoid the aggressive post-boot OS usage (see the 99% disk usage and 20% CPU usage after boot in the picture). Superfetch, Scheduled Tasks, and co. were disabled.

I will do the following runs:

1 Socket, 12 Threads, 12 Cores
2 Sockets, 6 Threads, 12 Cores
1 Socket, 6 Threads, 6 Cores
1 Socket, 11 Threads, 11 Cores

Here are the benchmarks.

1 Socket, 12 Threads, 12 Cores

Best: 15 threads, 35.939s (could be fluctuation)
6 Threads: 50.012s
12 Threads: 37.335s
Average overcommit (>12): 36.836s

2 Sockets, 6 Threads each, 12 Cores

Best: 17 threads, 36.703s (could be fluctuation)
6 Threads: 53.708s
12 Threads: 37.991s
Average overcommit (>12): 37.687s

1 Socket, 6 Threads, 6 Cores

Best: 14 threads, 43.671s (could be fluctuation)
6 Threads: 43.890s
12 Threads: 44.103s
Average overcommit (>6): 44.464s

1 Socket, 11 Threads, 11 Cores

Best: 13 threads, 36.623s (could be fluctuation)
6 Threads: 52.156s
12 Threads: 37.638s
Average overcommit (>11): 37.554s

So, what the conclusion? Do overcommit physical cores (6 in our example) on your Virtual Machine, but don’t overcommit logical cores (12 in our example at most):

Number of Sockets: 1 (or more if you have a Intel Xeon multi-CPU setup)
Number of Cores: total amount of logical cores
Number of Cores per Socket: respect your logical topology (if you have 4 Intel Xeons with 8 cores and hyperthreading enabled, then you have 16 cores per socket)

You may be able to squeeze 25% more performance from xgboost just by not following the “recommended settings” by Virtualbox / VMware. For the latter, they do sometimes recommend to provision logical threads. Due to licensing limitations in Windows clients, I did not run the Virtual Machine in a long setup (12 Sockets, 1 Core per Socket, 12 Threads).

And no, leaving one core for the host / overhead does not increase performance of your Virtual Machine. It slightly decreases it, although it is barely noticeable.

But yes, the Linux host was faster than the Virtual Machine. One could give also a shot at Clear Linux and its Intel C++ Compiler both designed for maximum performance on Intel CPUs.

See Part 2 for the long architecture setup test.

Virtualization & (Hyperthreading) Machine Learning Performance (Windows) (Part 1)

Written by Laurae