Benchmarking xgboost with and without virtualization

We have seen previously that the xgboost had a new fast histogram method leading to blazing performance. All our tests were done on a virtualized environment. What if we compare it in the most unfair scenario?:

  • Virtualized machine: Linux host, KVM virtualization, Windows client
  • Baremetal machine: Linux

This is what we are going to do. We have access to two extra machines, thanks to Yifan Xie (Intel machine) and drverzal (AMD machine) who helped for the benchmarking of xgboost exact and fast histogram:

  • Intel i7–7700K overclocked 5.0/4.7GHz, 64GB RAM, baremetal Linux
  • AMD Ryzen 7 1700 3.7/3.2GHz, 16GB RAM, baremetal Windows

Benchmarking

We are going to use the following to benchmark the three machines:

  • xgboost Exact
gc(verbose = FALSE)
set.seed(11111)
StartTime <- System$currentTimeMillis()
temp_model <- xgb.train(data = xgb_data,
nthread = i,
nrounds = 50,
max_leaves = 255,
max_depth = 6,
eta = 0.20,
tree_method = "exact",
booster = "gbtree",
objective = "binary:logistic",
verbose = 2)
  • xgboost Fast Histogram (old version)
gc(verbose = FALSE)
set.seed(11111)
StartTime <- System$currentTimeMillis()
temp_model <- xgb.train(data = xgb_data,
nthread = i,
nrounds = 200,
max_leaves = 255,
max_depth = 12,
eta = 0.05,
tree_method = "hist",
max_bin = 255,
booster = "gbtree",
objective = "binary:logistic",
verbose = 2)

Our xgboost tests consist on a training with the following parameter set on numeric Bosch full dataset (1,183,747 observations, 969 features, unbalanced dataset with 6,879 positive cases only).

Think it is hard to compile xgboost? Not at all:

devtools::install_github("Laurae2/ez_xgb/R-package@2017-02-15-v1")

Exact xgboost

tl;dr: baremetal wins.

Normalization per thread comparison:

  • Baremetal is faster overall.
  • AMD Ryzen is slower overall.
  • If we use AMD hyperthreading, our virtualized Intel machine gets smoked (in fact, the baremetal machine also gets smoked).

Cumulated Normalization per thread comparison:

  • Seen in this cumulated way, AMD is not that slow.
  • In fact, we would expect Intel to do much better but 47% higher clock for approximately 30% higher average faster time is clearly not that efficient.

Detailed Data Chart:

  • Need details? Ranking is obvious: Baremetal Intel (Linux) > Virtualized Intel (Windows) > Baremetal AMD (Windows)

Fast Histogram xgboost

tl;dr: gcc 7.1 wins.

Normalization per thread comparison:

  • With fast histogram, the GHz showoff starts. 35% higher clock rate for nearly 50% higher speed, isn’t it marvelous? (singlethread performance)
  • AMD is nowhere coming next to Intel (yet). Keep in mind if you are looking to get faster training from exact xgboost, fast histogram xgboost will just do it 10x to 30x faster (or even more) on large datasets.

Cumulated Normalization per thread comparison:

  • I think the conclusion is very easy to draw: the advantage of a virtualized Windows with Intel vs a baremetal AMD is 2/3 of the best scenario (baremetal Windows with Intel).

Detailed Data Chart:

  • I don’t think you can complain about doing 200 training iterations on Bosch in only 400 seconds (or less) these days.
  • Remember we are talking about training on 1,147,050,843 elements, and even a 90% sparsity would still make hundred of millions elements.

Conclusion

Some simple key takeways:

  • Use a baremetal machine if you want maximum performance. It does not really matter whether you want Linux or Windows, you already have plenty of performance.
  • Fast Histogram xgboost is already plently in performance. And you can even get a larger performance using the new fast histogram!
  • Throwing more cores is not the ideal for fast histogram xgboost, while exact xgboost likes getting more cores.
  • When comparing two algorithms, take the same baseline. The picture shown before does not take into account the difference on the number of training iterations. You are actually doing 4 times more iterations with fast histogram xgboost than exact xgboost, thus getting to the point of convergence. Also, the hyperparameters are RAM intensive for fast histogram xgboost (larger depth).

Previous post in this series:

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.