Boost your position on Leaderboard (Kaggle), How?

Published in

Analytics Vidhya

5 min readJun 3, 2020

Use XGBoost

Extreme Gradient Boosting or (XGBoost) is one of the most effective algorithms of ensemble machine learning techniques. It was introduced in the year 2015 by Tianqui Chen, since then it acts as a darling of machine learning hackathons and competitions due to its prediction performance and processing speed.It can be used for supervised learning tasks such as Regression, Classification, and Ranking. In this article we will try to understand how it works by dismembering into:

Boosting | Gradient Boosting | Extreme Gradient Boosting

It all started with Boosting…

Boosting (originally called hypothesis boosting) comes under ensemble learning which conforms to The Wisdom of Crowds . The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor.

Gradient Boosting

It is called Gradient Boosting as it employs Gradient Descent algorithm to minimize the loss function when adding new models . But, it suffers from a limitation of creating one weak learner (Decision Tree) at a time, which makes the model computationally expensive as it is a sequential process.

and , to overcome this —

Extreme Gradient Boosting comes into an existence, which is designed with the following features.

It supports parallelization by creating Boosted Trees Parallely.
It Implements Distributed computing methods for evaluating any large and complicated modules/Datasets.
It supports regularization.
Cache Optimization.

Now, let’s try to comprehend it :

As we know supervised machine learning models tend to find parameters (θ) that best fit the training data xi & labels yi and to train the model we need to define the objective function to measure how well the model fits the training data.

The Salient characteristics of objective function of any Supervised machine learning algorithm is — that it consists of two parts — Loss function and Regularization Term

obj(Ө) = L(Ө) + Ω(Ө)

L(Ө) = Loss function — Training loss measures how well model fits on training data.

Ω(Ө) = Regularization (mathematics) — It measures the complexity of model.

Trees — (What we are learning?)

XGBoost primarily selects Decision Tree ensemble models which predominantly includes classification and regression trees, depending on whether the target variable is continuous or categorical.

Here’s a simple example of a CART that classifies whether someone will like a hypothetical computer game X or not and each leaf value contains only one score.

Decision Tree Ensemble :

Most of the time, a single tree is not enough to provide better results on datasets therefore we build multiple trees and sum their prediction together.

Here comes an example:

Mathematically, general tree model can be represented as,

Where K = No. of Trees and as we are working on supervised problems our Objective function can be written as :

Possible ways to define Ω ?

Number of nodes in the tree or Depth
L2 norm of the leaf weights

This defines what to learn (objective and model) , but now comes something very important i.e How to Learn?

Solution: Additive Training (Boosting)

It’s very simple to understand and works in the same way as its name suggests — Start from Constant Prediction and keep adding new functions each time.

Now with this a new question arrives i.e , How to decide which f to add?Answer is simple : just optimize the objective !!

For optimization if we consider mean squared error (MSE) as our loss function, the objective becomes :

Extreme gradient Boosting uses Taylor expansion of second order for both regression and classification. The genius of the Taylor approximation divides it into relatively simple parts.

Taylor expansion of Objective function is :

After we remove all the constants, the specific objective function at step t becomes:

This becomes our optimization goal for the new tree. One important advantage of this definition is that the value of the objective function only depends on gi and hi.

Above discussed part gives us a conceptual idea of how XGBoost is to fit to training data and how it makes predictions.

Now , let’s consider another important part i.e Optimization.

Now without diving deep into the maths of it, regularisation for xgboost can be written as :

The term 𝛀 penalizes the complexity of the model.The additional term helps to smooth the final learnt weights to avoid over-fitting, when 𝛀 = 0, the objective falls back to traditional gradient tree boosting.

Practically it is impossible to enumerate all possible tree structures so a Greedy algorithm approach is employed.

Ah! We have come a long way and now finally lets summarize XGBoost.

Add a new tree in each iteration.
At the beginning of each iteration calculate gi and hi.
Make use of greedy algorithms to grow a tree .
Add a tree to the model.

where ∊ is a step_size or shrinkage to stop overfitting.

That’s all about my take on XGBoost.If you have any suggestion to improve it or would like to add something to it, feel free to drop me a message at this twitter handle @DSsakshi.

Disclaimer :

The stuff is highly inspired from the https://xgboost.readthedocs.io/en/latest/

Boost your position on Leaderboard (Kaggle), How?

Use XGBoost

Written by Sakshi Manga