A Comparative Analysis on Decision Trees, Random Forest and XGBoost

Published in

Codepth

6 min readSep 10, 2020

Machine Learning and AI has continued to be a growing and expanding field of interest and work for several years. It has continued to gain a lot of attention and has seen an influx of many undergraduates / working professionals to join and explore. So, if you are a beginner and need help with Decision Trees and their family of Ensemble methods, this story is for you.

Introduction :

The aim is to correctly classify the types of Glass, based on the number of elements (like Ca, Mg, etc.) they contain and their Refractive Index.

As you can see above, we have 214 rows and 10 columns. The first nine columns are the Features / Independent Variable. And the last column ‘Type’ is our target variable and describe what is the kind (class) of the glass

Count-plot for the Number of Examples of each Class

We have six classes in our Data-set. Also, it can be seen that there is a high class-imbalance, i.e. the number of examples we have isn’t the same. This could lead to some loss in accuracy for our model, as our model might be biased towards one class.

In cases where we have more than two classes, it is better to use a Classifier that has a non-linear decision boundary so that we can have more accurate predictions. Some examples of Non-Linear Classification Algorithms are Kernelised — Support Vector Machines, Decision Trees, and even a Logistic Regression model with Polynomial Features.

After some data-preprocessing and train-test split, we will be creating our Models. We will be using the same training and test set for each of the models.

Part I: Decision Trees

Decision Trees, as mentioned in the CS-229 class is a greedy, top-down, and recursive partitioning algorithm. Its advantages are — Non-linearity, support for Categorical Variables, and high interpretability.

We use sci-kit-learn and create our model. The values of the parameters used were found after a comprehensive grid-search cross-validation.

The Score for Decision Tree

Part II: Random Forest

Random Forest is an example of Ensemble Learning.

Ensemble Learning as described by the SuperDataScience Team is when you take multiple Machine Learning Algorithms and put them together to create a bigger algorithm in such a way that the algorithm created is using & leveraging from the ones used to create it. In the case of Random Forest many Decision Trees are used.

This is how a Random Forest works :

STEP 1: Pick at random ’N’ data points from the training set

STEP 2: Build Decision Trees associated with those ’N’ data points.

STEP 3: Choose the number of trees you want to build and repeat steps 1&2.

STEP 4: For a new data point, make each one of the trees predict the category which the data points belong to and assign the new data point to the category that wins the majority vote.

We use Scikit-Learn to create our Random Forest Model. The values of the parameters applied were found after a comprehensive grid search cross-validation, to boost accuracy.

The Score for Random Forest

As you can see, the Random Forest has a score of 81.5% as compared to Decision Tree’s score of 72.2%. This improvement in Accuracy is a result of several aspects such as bootstrap aggregation and since only a subset of features is to be used at each split, this helps decrease correlation and reduce over-fitting.

Also, Random Forests helps with missing values.

Visualization of a Decision Tree from Random Forest

Part III: XGBoost

XGBoost stands for Extreme Gradient Boosting and is another example of Ensemble Learning. We take the derivative of loss & perform gradient descent. As told in the CS-229 class, in Gradient Boosting we compute the gradient at each training point w.r.t the current predictor.

In the case of Boosting, we make Decision Trees into weak learners by allowing each tree to only make one decision before making a prediction, this is called a Decision Stump.

This is how Boosting works:-

STEP 1: Start with a data-set and allow only one decision stump to be trained.

STEP 2: Track which examples the classifier got wrong and increase their relative weight compared to correctly classified examples

STEP 3: Train a new decision stump which will be more incentivized to correctly classify these ‘hard negatives’, we then repeat these steps.

There are two types of Boosting that can be applied to Decision Trees — AdaBoost and XGBoost.

We use the XGBoost library. The values of the parameters applied were found after a comprehensive grid search cross-validation, to boost accuracy.

Score for XGBoost

The XGBoost even outperformed the Random Forest and has a score of 83.34%. This happens as this is a summation of all weak learners, weighted by negative log-odds of error.

Visualization of a decision tree from the XGBoost Model

Part IV: Comparison with a Neural Network

XGBoost and Random Forest are two of the most powerful classification algorithms. XGBoost has had a lot of buzz on Kaggle and is Data-Scientist’s favorite for classification problems. Although, a little computationally expensive they both make-up for it for the accuracy they have and run smoothly on GPU powered computers.

I used a Neural Network, for this same data-set. I used Scikit-learn’s Multi-Layer Perceptron Classifier.

Score of MLP

You will notice that MLP underperformed in comparison to RandomForest and XGBoost. Even after having 50–100 nodes in its layers, and a time-taking grid search, we only had an accuracy of 77.78%. This could be because of two reasons — the small size of the data-set, and the high-class imbalance.

Conclusion :

Even in a small data-set with high-class imbalance, Random Forest and XGBoost were able to do well. If more parameter tuning is done, they might have even more accuracy.

As for the comparison between XGBoost and RandomForest, both of these work on different approaches. RandomForest works on Bagging & Aggregation and uses fully grown trees, while XGBoost works on Boosting and uses weak learners. RandomForest is faster and less computationally expensive, and also requires fewer parameters to tune.

The Data-set and the code can be found here.