XGBoost Math Intuition Summary
A complete explanation for XGBoost - Most popular Gradient Boosting Model
XGBoost is derived from Gradient Boosting Model(GBM), compared with GBM, XGBoost introduces a different way to train the ensemble weaker leaner, so let’s start from here.
Suppose we have the below K trees boosting mode
Where fₖ is k-th tree in the ensemble, xᵢ is i-th data point
Similarly, the prediction at the t-th step can be defined as
Training Objective
Below is the object function to train the t-th weak learner, let’s call it step t-th object function
L is loss function(ex. MSE for regression, Cross-entropy for classification)
Ω( fᵢ) is the i-th weak learner’s regularization term
Taylor Expansion on Training Objective
In the above step t-th object function, previous t-1 prediction function y head can be think as a variable and t-th weak leaner fₜ(x) is the delta change, so XGBoost uses second order Taylor Expansion to approximate the step t-th object function
In the above equation, at current t-th step, t-1 step prediction y head and all before t regularization are known values, so they are constant values in t-th step object function. Remove the constant terms(because they not impact object function optimization), we get
Some Tree Mapping Function Definition
j represents a tree j-th leaf (assume all leaves of a tree are indexed from 1)
Iⱼ represents a set contains all data instances which locate in tree’s j-th leaf
q(xᵢ) represents a function which maps data instance xᵢ into a tree leaf and return the leaf index
wᵢ represents i-th leaf score(weight or output)
So f(x) actually represents a tree output for instance x
Objective Function Rewrite With Regularization
T is the number of leaves in the t-th weak leaner tree fₜ(x)
γ, λ are regularization hyper-parameters
Optimize Objective Function
Now the t-th step object function is a function of (wᵢ), gᵢ and hᵢ values are known because they only related to loss function and step t-1 prediction values.
So we can use below equation to get best wᵢ to minimize object function
The optimal w is
And the corresponding minimal object value is
Weak Learner Tree Splitting Criteria
So far, we got the t-th step object function, next step is to build the t-th tree, and this tree should be constructed to reduce object function value as much as possible.
In order to build a tree to reduce object function value, we only allow node split which can reduce object function value, and looking for a best split which can reduce the most.
So in each split we measure the object function value reduce by Tree Object function value(After Node Split) — (Before Node Split)
Gain is how much object function value reduced in the split.
Iₗ is left splitting child leaf
Iᵣ is right splitting leaf
I is parent leaf
For simplicity, each leaf can calculate its Similarity Score
Splitting gain can be expressed as
Left(Similarity Score)+ Right(Similarity Score) - Parent(Similarity Score)
Simplified Summary For Regression and Classification
We know calculating tree node similarity and tree leaf output wᵢ will base on the chosen loss function, because gᵢ and hᵢ are 1-order and 2-order derivatives from loss function.
StatQuest with Josh Starmer gives a a good simplified summary for quick reference.
- Similarity Score is applied for every node in the tree
- Output Value normally is for leaf node output(wᵢ)
For regression, the chosen loss function is based on MSE
For classification, the loss function is a custom logloss function
Where pᵢ = Sigmoid(yᵢ) The output class probability is calculated via logistic transformation on tree output prediction yᵢ
XGBoost Training Features
- When searching for best feature value for node split, XGBoost provides an option to search on the feature value’s quantiles or histogram instead of try all the feature values to split node.
- When building feature histogram, XGBoost may split feature data into multiple computers to calculate histogram, then merge back to generate a aggregate histogram, this like Hadoop Map-reduce operation, and the generated histogram will be cached for next split.
- XGBoost can automatically handle missing values in feature. In tree node split step, XGBoost will either assign all missing value instances to left or right child, depend on which side has larger gain.
- XGBoost provide lots hyper-parameters to deal with overfitting
XGBoost Example with Python
Install Library
pip install xgboost
Example
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_splitboston = load_boston()
data = pd.DataFrame(boston.data)
data['PRICE'] = boston.targetX, y = data.iloc[:,:-1], data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=22)xg_reg = xgb.XGBRegressor(
n_estimators = 50,
max_depth = 4,
learning_rate = 0.3)xg_reg.fit(X_train,y_train)preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))#RMSE: 3.509045
Plot the last tree (total 50 trees) in ensemble
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz/bin'
xgb.plot_tree(xg_reg,num_trees=49)
plt.rcParams['figure.figsize'] = [50, 10]
plt.show()
Plot feature importance
feature_imp= xg_reg.get_booster().get_score(importance_type= 'gain')
sorted_feature_imp = np.array(sorted(feature_imp.items(), key=lambda kv: int(kv[1]), reverse=False))
plt.barh(
boston.feature_names[sorted_feature_imp[:,0].astype(int)],
sorted_feature_imp[:,1].astype(float))
plt.xlabel("Feature Importance")
The importance_type can be
- weight : The number of times a feature is used to split the data across all trees.
- gain : The average gain across all splits the feature is used in.
- cover : The average coverage across all splits the feature is used in.
- total_gain: The total gain across all splits the feature is used in.
- total_cover: The total coverage across all splits the feature is used in.
REFERENCE
- XGBoost: A Salable Tree Boosting System
- Kaggle Winning Solution Xgboost Algorithm — Learn from Its Author, Tong He
- XGBoost Documentation
- StatQuest with Josh Starmer XGBoost Part 1: Regression
- StatQuest with Josh Starmer XGBoost Part 2: Classification
- StatQuest with Josh Starmer XGBoost Part 3: Mathematical Details