Tree & Boosting Methods in Machine learning

11 min readMar 19, 2023

Tree based methods are quite often used because their decision making skills that helps to give more accuracy to the model. It is a supervised machine learning model where we have labelled dataset with inputs & outputs. There are three tree based models following as such:

Decision tree
Random Forest
Boosted trees

Decision tree : It is majorly used in classification problems where we have to choose a particular class because of its decision making skills but it can also be used in regression problems also where the target is a numerical value. It consists of root node, leaf node, terminal nodes.

Advantages of uses:

We can give transformation any graph into decision trees. It helps us to make decisions.

2. Logic behind this trees is easily understandable for us.

3. less time required for data cleaning

Disadvantages: There might be some unwanted branches that to be pruned by selecting max depth & gini impurity.

Structure:

Root node is a root point from where decision tree starts. It is also called as parent nodes. Leaf nodes are child of node of root nodes.

terminal node is final output node.

Branch/ Sub tree is a tree formed by cutting the parent tree.

Pruning is the process of cutting the unwanted branches of trees.

Working method: Just like above example the model first ask what is the value of X. If it is less than 1 we will go further to find the value of Y. If the condition is right then Y=1.5 & if it is not then we will further looking for value of X. if X is still less than 2, Y=2.2. If last conditions fails Y=1.5

We find the best attribute of dataset using Attribute Selection Measures by at least one of these two methods 1. Information gain, 2. Gini Impurity.

Information gain is the measurement of change of entropy. Changes occurs based on how much information a feature provides a class. Based on this information we split the node.

Information gain= Entropy(S)- (Weighted Average) *Entropy(each feature)

Here entropy measures the impurity, which is to be calculates as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no), S =total sample no.

In another method Gini impurity which is a measurement of impurity while making a decision in tree based methods. It helps to explore the splitting criteria

G= Σ Pc(1-Pc), where Pc = probability of being in class C

We can calculate gini impurity(G) based on the data.

if G=0 , it implies only one single class with full confidence.

Steps of Decision trees: First to construct a tree we need to decide what features will be used as roof node. Each node is to be split based on lowest gini impurity of features. Then we will use gini impurity to compare the information gain contained with the features for the training data. After that calculate the G value for each split & repeat the process until lowest gini impurity arrives. For multi categorical features, we will calculate of all set of combinations of multiple category.

Precautions to overfitting : 1. Gini impurity helps to pruning which decreases the overfitting chances. 2. We can apply grid search to minimize the error & to know at which value of max depth, max leaf node the error will be minimized.

Now we will go for making a model by coding in python.

import numpy as nm .....importing the library for matrix calculation
import matplotlib.pyplot as plt
import pandas as pd  
  
#importing datasets  
data_set= pd.read_csv('user_data.csv')  
  
#Extracting Independent and dependent Variable  
x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values  
  
# Splitting the dataset into training and test set.  
from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)  
  
#feature Scaling  
from sklearn.preprocessing import StandardScaler    
st_x= StandardScaler()  
x_train= st_x.fit_transform(x_train)    
x_test= st_x.transform(x_test)    
#Fitting Decision Tree classifier to the training set  
From sklearn.tree import DecisionTreeClassifier  
classifier= DecisionTreeClassifier(criterion='entropy', random_state=0)  
classifier.fit(x_train, y_train)
y_pred= classifier.predict(x_test)
#Importing the library of error matrics
from sklearn.metrics import classification_report,plot_confusion_matrix,confusion_matrix
print(classification_report(y_test,model_preds))
plot_confusion_matrix(X_test,y_test)
model.feature_importances
pd.DataFrame(index=X.columns,data=model.feature_importances,column=[feature_importances])
#Visulating the tree
from sklearn.tree import plot_tree
plt.figure(figsize=(12,8),dpi=300)
plot_tree(model,feature_names=X.columns,filled=True);

2. Random Forest: It based on ensemble learning which is the process of combining two or more decision trees. Instead of sticking to one tree, it takes the prediction of more trees & based on the majority predictions it makes the decision for the output.

Structure:

Advantages: 1. it takes less training times, 2. Higher accuracy for large dataset also, 3. It deals very well with missing values.

Disadvantages: It is only for classification task, but not suitable for regression takes.

Working Method:

Step 1: First we will divide the dataset into train & test set.

Step 2: We select K points from training set.

Step 3: We will build the ‘n’ number of trees by the help of k data points.

Then we have to repeat step 1 & 2 until the accuracy score is maximized.

Last step: For new points of last step when accuracy is highest, the model will find the predictions for each decision trees & assigns the new data points to the particular category where majority votes wins.

Hyperparameters to be tuned of Random Forest to reduce the correlation among the decision trees.

n_estimators = no of trees to be used (generally 64<n<100)

no of features = how many features to be included in subset ( generally n/3 or √n.

Ensemble helps us to reduce the bias & variance error.

https://www.kaggle.com/code/paulrohan2020/implementation-of-bootstrap-in-random-forest/notebook

Bootstrap : It resamples the original dataset with replacement so many times to create simulated datasets. The forest draws random trees from the datasets. By allowing bootstrapping we are telling the forest that it needs to draw the sample from same group of data again & again that we can know how much accuracy we have about the whole dataset.

Bagging : It is an example of bootstrap. It reduces the high variance of decision trees in a forest. If some errors are found in high upper level depth of forests, the entire trees below the level can result different as it tree methods are sensitive. Allowing bootstrapping we create B variants X1,X2,X3 ….Xb of the training set X, by uniformly drawing sample from the dataset. For each of training set variants, a separate tree is constructed. The output will be based on the majority output of the sub classifier trees.

Out of Bag Errors: By allowing bootstrapping multiple samples are drawn from original sample X to create a new training set. So this way multiple decision trees are constructed to form a random forest. Finally random forest averages all the output of all trees to predict the result. But due to bootstrapping for randomly tree generation process all samples are not used in every tree. So those unused samples are called as Out of Bag errors. So it is data leakage & very prone to overfitting also.

OOB Score: From the Out of Bag Samples the correctly predicted rows are called OOB score. Though it improves accuracy, it is time consuming & not good for very large dataset.

We will go for python implementation of Random Forest.

#....Importing the required libraries
import pandas as pd, numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#.....importing datasets  
data_set= pd.read_csv('disease_data.csv')  
  
#.....Extracting Independent and dependent Variable  
x= df.drop('heart disease',axis=1)
y= df['heart disease']
#..... now lets split the data into train and test
from sklearn.model_selection import train_test_split
#..... Splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, random_state=42)
X_train.shape, X_test.shape
#.....Fitting Random Forest classifier to the training set  
From sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42, n_jobs=-1)
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10,25,30,50,100,200]
}
from sklearn.model_selection import GridSearchCV
#..... Instantiate the grid search model
grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv = 4,
                           n_jobs=-1, verbose=1, scoring="accuracy")
grid_search.fit(X_train, y_train)
grid_search.best_score_
rf_best = grid_search.best_estimator_
rf_best
from sklearn.tree import plot_tree
plt.figure(figsize=(80,40))
plot_tree(rf_best.estimators_[5], feature_names = X.columns,class_names=['Disease', "No Disease"],filled=True)
rf_best.feature_importances_
imp_df = pd.DataFrame({"Varname": X_train.columns,"Imp": rf_best.feature_importances_})
imp_df.sort_values(by="Imp", ascending=False)

3. Boosting Methods: Boosting is a ensemble learning process of transforming of a set of weak learners into a strong learners to minimize the model error. Here from a randomly chosen training set data to be selected & we will train the model. Each model tries to compensate the weaknesses of its previous one. At every iteration the weak rule of all classifiers are combined to create a new strong classifier. This is main fundamental principal. In ensemble learning we have already discussed the Random Forest method which is based on the bagging (low bias, high variance) which is used to avoid the overfitting. Boosting (high bias, low variance) is also a ensemble learning which is more prone to overfitting. There are many types of boosting methods as following :

Adaboost (Adaptive Boosting)
Gradient Boost
XGBoost

Adaboost: It combines multiple weak learner into a single strong learner.

https://www.geeksforgeeks.org/boosting-in-machine-learning-boosting-and-adaboost/amp/

B1 contains here 10 points with 2 plus sign in blue region and 8 points ( 3 plus points & 5 minus points) in red region. Our goal is to differentiate plus & minus with blue 🔵 & red 🔴.

In next iteration B2 learns 10 points with 8 points ( 5 plus sign & 3 minus sign) in blue and 2 minus sign in red region. Again in next B3 step, 3 plus sign in blue region & 7 points ( 2 plus sign of blue & 5 minus sign of red) in red region. Finally (B4 = B1+B2+B3), the model takes the value of previous all iteration & makes an average. Here we can see the model updates itself with 5 blues plus sign in blue region & 5 red minus sign in red region. This way model gains highest accuracy.

Gradient Boosting: It can work over very large dataset. It actually combines both the concept of gradient descent & boosting method. In adaboost method, weights of data points are changed but in gradient boost, always try to update the residual error of last iteration. The key idea is to minimize the error of next iteration . A small change in prediction for a case effects a significant error drop, then the next prediction is better, but if it does not reduce the error, the next prediction is not better.

Tips:

We should always check the accuracy of both train & test dataset because of chances of overfitting.
We should try Cross validation & Grid search to select the best parameters to minimize the model error.

# importing the library for matrix calculation
import numpy as nm 
import matplotlib.pyplot as plt
import pandas as pd  
  
#.....importing datasets  
df= pd.read_csv('disease_data.csv')  
  
#.....Extracting Independent and dependent Variable  
x= df.drop('heart disease',axis=1)
y= df['heart disease']
#..... now lets split the data into train and test
from sklearn.model_selection import train_test_split
df.head()
df.info()
df.isnull().sum()
df=df.dropna()..## if any Null values
df.info()
df.fillna()
df.head()
feature=df.describe().transpose().reset_index().sort_values()
plt.figure(figsize=(10,6),dpi=200)
sns.barplot(data=feature,x=,y=)
plt.xticks(rotation=90);
sns.heatmap(df.corr(),annot=True,cmap='viridis')
X=pd.get_dummies(X,drop_first=True)
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test= train_test_split(X,y,test_size=0.15,random_state=101)

#Trying AdaBoostClassifier with n=1

from sklearn.ensemble import AdaBoostClassifer
model=AdaBoostClassifer(n_estimators=1)
preds=model.fit(X_train,Y_train)
model.predict(X_test)
from sklearn.metrics import calssifcation_report,Plot_confusion_matrix
print(calssifcation_report(y_test,preds))
plot_confusion_report(y_test,preds)
model.feature_importances_
model.feature_importances_.argmax()
x.coloumns[22]
sns.barplot(data=df,x=,hue=)
len(X.coloumns)

#AdaBoostClassifier with Gridsearch to check the best value of n

error_rates=[]
for n in range(1,96):
  model=AdaBosstClassifier(n_estimator=n)
  model.fit(X_train,Y_train)
  preds=model.predict(X_test)
  err=1 -accuracy_score(y_test,preds)
  error_rates.append(err)
plt.plot(range(1,100),error_rates)
feats=pd.DataFrame(index=X.colomns,data=model.feature_importances_,columns=['Importance']
imp_feats=feats[feats['Importance']>0]
sns.barplot(data=imp_feats.sort_values('Importance'),x=imp_feats)

#GradientBoostClassifier with Gridsearch to check the best value of n, max depth,learning rate

from sklearn.ensemble import GradientBoostingClassifer
from sklearn.model_selection import GridSearchCV
param_grid={'n_estimators':[50,100],'learning_rate':[0.1,0.05,0.2],'max_Depth':[3,4,5]}
gb_model=GrdientBoostingClassifer()
grid=GridSearchCV(gb_model,param_grid)
grid.fit(X_train,Y_train)
from sklearn.metrics import classification_report,plot_confusion_matrix
prds=grid.predict(X_test)
grid.best_estimator_
grid.best_params_
print(calssifcation_report(y_test,preds))
plot_confusion_report(y_test,preds)
imp_feats=grid.best_estimator_.feature_importances_
grid.best_estimator_.feature_importances_.argmax()
imp_feats=pd.DataFrame(index=X.colomns,data=imp_feats,columns=['Importance'])
imp_feats=imp_feats[imp_feats['Importance']>0].sort_values('Importance')
sns.barplot(data=df,x=,hue=)
ply.xticks
len(X.coloumns)

XGBoost: It is extreme gradient boosting algorithm. Here decision trees are created in sequential form one by one. The weights are assigned to the features of trees. If the weights are assigned to trees are predicted wrong, the variables are fed into second decision trees. This individual classifiers are combined into a single strong model which predicts the output with the minimum errors.

Advantages:

It can work on large datasets.
It can handle missing value very well.
It has ability to outperform overfitting.
It has ability to do regression & classification task both.

XGBoost Ability to outperform others due to :

Cache awareness & out of core computing: The algorithm has been designed to use the hardware resource properly. The gradient statistics to be stored by internal buffer in each thread. When the dataset is huge that do not save into the memory, out of core computing optimize available disk space.
Tree pruning: The algorithm uses max depth parameter to depth first approach for the tree pruning to cut the branches backward in XGBoost, so we do not lose information
Parallelized tree building: XGBoost initiates the sequential parallelized dtree building process. Loops are interchanged for building the base learner where outer loops mentions the leaf nodes of the tree, the inner loop calculates the features. If computation of inner loop is not completed, outer loop is not started. So to improve the computation speed, the order of loops is interchanged using initialization through a global scan. At the very beginning those loops are performed where run time required is less. This way computation speed increases.
Regularization: It penalizes the loss function by regularization with a penalty term (in L1 & L2 both way) to prevent the overfitting.
In built cross validation: We do not need to fed the cross validation manually into the algorithm because XGBoost comes with a in built cross validation method.
Dealing with missing values: It admits sparse feature for inputs. It learns the best missing value depending upon the training loss.

# importing the library for matrix calculation
import numpy as nm 
import matplotlib.pyplot as plt
import pandas as pd  
  
#.....importing datasets  
df= pd.read_csv('train.csv')

#.....Extracting Independent and dependent Variable  
x= df.drop('SalesPrice',axis=1)
y= df['SalesPrice']
#..... now lets split the data into train and test
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
df.head()
df.info()
X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])
train_X, test_X, train_y, test_y = train_test_split(X.as_matrix(), y.as_matrix(), test_size=0.25)
my_imputer = Imputer()
train_X = my_imputer.fit_transform(train_X)
test_X = my_imputer.transform(test_X)
from xgboost import XGBRegressor
my_model = XGBRegressor()
# Add silent=True to avoid printing out updates with each cycle
my_model.fit(train_X, train_y, verbose=False)
# make predictions
predictions = my_model.predict(test_X)
#..... now lets import error matrix
from sklearn.metrics import mean_absolute_error
print("Mean Absolute Error : " + str(mean_absolute_error(predictions, test_y)))
#..... model tuning with n_estimators
my_model = XGBRegressor(n_estimators=1000)
my_model.fit(train_X, train_y, early_stopping_rounds=5,eval_set=[(test_X, test_y)], verbose=False)
#..... model tuning with n_estimators & learning rate
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(train_X, train_y, early_stopping_rounds=5, eval_set=[(test_X, test_y)], verbose=False)

Tree & Boosting Methods in Machine learning

Written by Soumallya Bishayee