Thinking Before Building: XGBoost Parallelization
In recent years, a lot of jobs in Tech have been transformed thanks to new tools and open source libraries which are easy to use. In data science, Scikit-learn, Keras, Jupyter or Tensorflow, only to cite a few, have already significantly improved the pace at which we can deliver. But, in some cases, not knowing the internal mechanics can lead to counter-productive solutions.
At BlaBlaCar, we mainly use the famous XGBoost to build our machine algorithms. XGBoost is made to build gradient boosted trees, a highly performant class of algorithms. Read the fast machine learning prediction post by Eduardo on why we choose this solution.
A gradient boosted tree model is a succession of decision trees where the N-th tree is built to correct the prediction of the (N-1)-th tree. It is defined by several hyperparameters: depth of each trees, number of trees, % of samples used to train each tree, % of features used at each split within the trees …etc
Hyperparameters are the parameters of the algorithm itself and not the parameters that the algorithm will learn during the training stage. In order to build the best algorithm, it is required to find the best hyperparameters. Hyperparameter tuning usually requires a lot of computing power and time.
Fortunately, a great gift was recently given to data scientists: Cloud Computing. A vast amount of computing power easily accessible at a (reasonably) low price. So we decided to use a machine with 64 cores on the cloud to run this task and take advantage of parallel computing.
Now, anyone who used XGBoost would know that the parallelisation of the algorithm is extremely well implemented. If you have not used it, trust me it is mind blowing :). Well…, there is a trick.
In XGBoost, the parallelisation happens during the construction of each trees, at a very low level. Each independent branches of the tree are trained separately. Hyperparameters tuning requires many branches per tree and many trees per model and several models per hyperparameters value and many hyperparameters values to be tested…
Moreover, parallelisation has one main drawback: the overhead. Before any parallel computation, data has to be sent to each cores. Thus parallelising at such a low level might not be the best idea. Instead we should give the entire dataset to each core and let it train a full model by itself.
To test our hypothesis we built a small protocol to compare the 2 methods. We tested 200 hyperparameters combinations for which we trained each algorithms in a 5-fold manner and computed the average score.
The numbers were compelling, for more than 8 cores, it clearly is faster to parallelise at the hyperparameter level. For the real optimisation, we searched over a total of 6000 hyperparameters combinations which would have taken 7 times longer with the wrong parallelisation (11 hours instead of 1.5 hours).
All the great tools that are now available to data scientists will never replace good engineering and thinking. There are still many use cases that require thinking to generate great results.
More generally, this is true and applicable to many others disciplines: Software engineering, Data engineering, UX design, Marketing and Finance. Many tools have been developed to make our job more efficient but it often comes at the cost of understanding what the tool is doing. It remains crucial to be able to deep dive when necessary.
Always think first. 💡