XGBoost : Fast and High on Performance
Who wouldn’t like to have a Machine Learning tool which is fast and yet high on performance? One such tool which should be in the toolkit of a person who wishes to do machine learning is the XGBoost library. XGBoost is the abbreviated form of Extreme Gradient Boosting It is a library for developing fast and high performance gradient boosting tree models. The credit for creating this library is given to Tiangi Chen. This library got a boost in its usage since when the winners of various Kaggle competitions have voiced their love for this library on open platforms.
This library creates tree based model which are well known for their performance on tabulated data.
XGBoost employs ensemble method that seeks to create a strong classifier model based on “weak” classifiers. Here, weak and strong refers to the magnitude of correlation between the learners and the target variable.
Here, the models are added on top of each other iteratively such that every successive model learns from the previous model, thus, correcting the errors made by the previous model. This, process is iterated up to a level at which the training data is accurately predicted (reproduced) by the model.
Gradient Boosting also comprises of an ensemble method that sequentially adds predictors and corrects previous models.
This model instead of assigning weights to the classifiers after every iteration, fits the new model to the residuals of previous predictions and in the process minimizes the loss when adding the latest prediction.
So, at last you’re updating your model using gradient descent. It supports both Regression and Classification problems.
Now after going through the above lines one would wonder if this technique makes a model which improves itself until its prediction on training data is similar to the training output dataset.
So, there is high probability that the model created is over fitted (i.e. the test error is considerably greater than the training error).
We do have a solution for this. One can use hyper parameter tuning techniques in order to reduce the variance and the bias or can say to avoid over fitting of the model on the data. For instance, Bagging technique (i.e. averaging the trees as average has less variance than individual trees) can be used along with Boosting.
If one is still facing the problem of over fitting one can try sub-sampling i.e. reduce the fraction of data used.
I am sure that this attempt of mine will help you understand the reason behind wide usage and acceptance of XGBoost in the Machine Learning fraternity.