I did not see the impact of nthread (n_jobs in Python for those using exclusively R) on xgboost GPU other than:
- Oversubscribing the CPU slows xgboost (I tried on my server from 1 to 72 for 1 to 4 GPUs before writing this long blog post, it was juste a waste of time because every result was identical other than the oversubscribed CPU part which increased the required training time)
- Putting less nthread than requested GPUs crashes xgboost
- Note: number of threads might be used elsewhere but I do not see where exactly for xgboost GPU. Probably have to look on xgboost source code for #pragma omp
As long as everything fits on RAM and the data is small enough, you should get good parallelization especially because xgboost GPU spends a lot of time waiting for computations. This is why you see such good scale up (GPU busy longer).
However, if you fully saturate the GPU you might get negative efficiency (you probably have to hit the same level as I did, 50+ xgboost at the same time?).
It’s still better to parallelize a cross-validation (per fold) than parallelizing inside a fold (the xgboost model) as you noticed:
- Training multiple xgboost models at the same time (per fold): fast because you put the efficiency issue at the data (efficiency issue at the data, easy to solve as it’s a non recursive and embarassingly parallel task, it should be the preferred scenario)
- Training sequentially xgboost parallel models (per model): slow due to poor efficiency of xgboost for scaling (efficiency issue at the model, hard to solve)
In most cases I’m seeing about 50% faster on CPU/GPU for a parallel 5-fold CV vs a sequential model with multiple threads/GPU.