Ensemble Models in Machine Learning

Mehul Gupta
Data Science in your pocket
4 min readJul 17, 2019

--

Introduction to Decision Trees

Decision tree has been amongst the most commonly known model when it comes to Machine Learning or Data Science. In this blog post, we will learn what is ensemble techniques and how they work using a decision tree as our base model.

Note: This article assumes that you are familiar with the decision tree algorithm and how it works. You can refer this link to refresh your concepts on the same: Decision Tree Simplified

Although the decision tree is a very simple and powerful algorithm, there are a few problems with Decision trees. One of the major issues is that the algorithm tends to overfit, especially when the depth of the tree is high. Is there any way to overcome this problem? How about instead of using a single Decision tree, we build multiple trees? Let's learn more about it in the next section.

Introduction to Ensemble Learning

Here comes ensemble modeling in the picture. Ensemble models are nothing but an aggregation of a number of Weak Learners, a model performing just better than random guessing. Most of the time, decision trees are used as weak learners. i.e using 10s or 100s of different decision trees, getting their results and combining these different results to get onto a final result. Let us now talk about various ensemble models

  1. RandomForestClassifier-

Random Forest has been amongst the most used ensembling model that follows up the concept of Bagging. Here, we would be considering a number of trees, take 1000s of Decision trees, all independent of each other, might be using entire/part of the training dataset (the distribution would be random) and producing different predictions. And using these results, and the average result is taken and considered as the final prediction by the model. It ensures the model don’t overfit.

Example — let us have 100 Decision trees out of which 60 predict 1 and 40 predict 0(considering binary classification). As 1 predictors are more, hence result is 1.

Documentation Link: RandomForestClassifier

2. Gradient Boosting Machine-

Unlike the Random Forest Classifier which works on the concept of Bagging, GBM uses Boosting. Here also we would be taking up 10s of Decision Trees but they won’t be independent. These trees would be working in a sequential order. The output of one tree is used by other trees to focus more on the errors and to fit over the residuals. The common problem faced is it overfits very soon, hence keep the number of trees comparatively low to RFC.

Example-Let us have 5 decision trees.The 1st one,let F1 intakes the training data and produces output Y1.Now, the 2nd tree, let H1, would take X as input but Y — Y1(predicted by tree 1, F1) as target. The combined output of F1 & H1 is the final output.If number of trees are more, the same chain continues.

Y2=F1(X): target is Y+H1(X): target is Y — Y1 where

X=input/training data

Y=Target value

F1=a weak learner

H1=Booster for F1, the new decision tree model

Y1=output of F1(X)

Y2=Improved results

Now for the next boosting round, we use

Y3=Y2 + H2(X): target is Y-Y2

here, all notations remain the same except H2 is the new booster and Y3 is an improved version of Y2.

now the same step can be repeated further on for better results for the mentioned number of trees for the ensembling purpose. The rest of the models described below uses Boosting Technique for ensembling purposes.

Documentation Link: Gradient Boosting Classifier

3. eXtreme Gradient Boosting Machine-

It is the most popular model when it comes to Kaggle competitions. It is an upgraded version of GBM hence faster and uses less space as it doesn’t go for all possible splits, but for some useful splits i.e. if 1000 splits point are possible, it may go for only 100 best points hence savings everywhere whether space or time!!! (using a presorted splitting algorithm). It is often taken as Regularized GBM as a term lambda(let it be L for now) is multiplied with the function being used for boosting in the above example(H1). Hence equation becomes L*H1() instead of H1().

Documentation Link: Extreme Gradient Boosting Classifier

4. Light Gradient Boosting Machine-

LGBM has also been amongst the emerging models getting popularity in the data science domain. Though accuracy for both, XGB & LGBM models is quite close, their implementation is slightly different. To find out best splits, amongst all possible splits(100 out 1000 split points concept, hence to reduce the extra work ), LGBM uses Gradient-based one-sided sampling(GOSS) while XGB uses pre-sorted algorithm for splitting purpose.

For explanation related to GOSS and pre-sorting splitting, kindly Check here

Documentation Link: Light Gradient Boosting

5)Catboost-

Not so popular, CatBoost is comparatively slower than LGBM & XGB but has an unbeatable advantage, it can intake categorical data as text form (you need to mention which columns are categorical) and train the model, & hence the name Categorical Boost. The case is, it understands categorical data while other models just accept it when presented as numeric.No preprocessing step for converting text to numeric using OneHotEncoder or LabelEncoding for categories is required because of which it produces better results.

To know how categorical data intake is done, refer here.

Documentation Link: CatBoost Documentation

Apart from this, there are a lot of ensemble models coming up, showing up good results than traditional models. Each model has its merits and demerits as well. The right model depends on the problem and the dataset available. As according to the Free Lunch Theorem, no perfect model exists and hence

--

--