Ensemble methods and why they are most preferred among Machine Learning practitioners.

Shafil Ahamed
AlmaBetter
Published in
7 min readAug 26, 2021
source: https://freedesignfile.com/219961-dense-forest-landscape-vector-material/

INTRODUCTION

One of the major disadvantage a tree based algorithm face is its tendency to overfit the data points.

  1. The Decision tree tends to split on the features that are most important for the model and features that hold slightly less value are left as it is.
  2. Allowing a decision tree to split till a granular degree, is a prime behavior of this model that makes it prone to learning every point extremely well — to the point of perfect classification on a training set — ie: overfitting.

Now before getting into how Ensembling technique overcomes this drawback of Decision Tree, let us first define what are these Emsembling techniques.

Ensemble techniques are meta algorithms that combine several machine learning techniques into one supervised model in order to decrease variance through bagging, bias through boosting, or improve predictions through stacking.

Ensemble methods can be divided into two Sub-categories:

  • Sequential ensemble methods are a way in which data is trained on multiple weak learners(stumps) sequentially and the input of each successive learner is the residual of previous learner.(e.g. AdaBoost,CatBoost,GBM).
    The overall performance can also be boosted by weighing previously mislabeled examples with higher weight.
  • Parallel ensemble methods where multiple weak learners work in parallel to come to a strong prediction either by taking average or by voting classifiers.(e.g. Random Forest).
    The basic motivation of parallel methods is that the error can be reduced dramatically by averaging.

Why are Ensembling methods so popular:

  • Performance: An ensemble can make better predictions and achieve better performance than any single base model.
  • Robustness: An ensemble reduces the variance of the predictions and model performance.
  • Stability:We use Bagging when there is low bias — high variance and we use boosting when there is high bias — low variance.

Bagging

source : https://corporatefinanceinstitute.com/resources/knowledge/other/bagging-bootstrap-aggregation/

Bagging, also known as bootstrap aggregation, is an ensemble technique that is often used to reduce variance within a dataset. The technique is useful for both regression and classification method.Bagging is used with multiple decision trees, where it raises the stability of models by improving accuracy and reducing variance, which eliminates the problem of overfitting.Let us understand this with most common bagging algorithm called Random Forest.

Random Forest takes its name in a quite literal sense. Like a forest consists of multiple trees, similarly, the Raondom forest algorithm uses multiple decision trees in a bagging fashion.This in turn reduces the variance by taking an average of the predicted values of all the trees.

Bagging has two subcategories Bootstrap and Aggregation.

Bootstrap is a sampling method that works on a principle of sampling with replacement. Let's say we have M rows and N features. We will sample out some data which has m rows and n features, given M>m and N>n. After sampling out we replace the space created in the original data with the same values, we do this to completely randomize our sample.We get large number of such samples and feed them to individual decision trees.Another advantage of sampling is that all features get equal opportunity to get split upon in a decision tree.

Source: https://www.researchgate.net/figure/Bootstrap-Sampling-Procedure-The-classical-procedure-of-bootstrapping-involves-sampling_fig1_332849902

Aggregation is a way of combining the predictions that multiple decision trees have made. Aggregation methods are different for classification and regression. In the case of regression, an average is taken of all the outputs predicted by the individual classifiers, this is known as soft voting. For classification problems, the class with the highest majority of votes is accepted; this is known as hard voting.

Boosting

Source : https://en.wikipedia.org/wiki/Boosting_(machine_learning)#/media/File:Ensemble_Boosting.svg

Boosting is an ensembling technique specifically used when the model has high bias and low variance unlike bagging this technique works sequentially, it takes a weak learner and converts it into a strong learner by successively modeling on the residuals of the previous learner. Let us understand it in layman's terms.

  1. The model first takes a weak learner preferably a stump(one split decision tree)we feed the data into it.
  2. The model underfits on the data, leaving residuals.
  3. Now because the model was unable to capture all the features of the data,there are still multiple features present in the residual(error margins)
  4. This residual is given to the next weak learner as input which again is able to capture some features from it.
  5. The process goes on till there are no features left in the residuals and the cost function is reached its minima.
  6. The final value is calculated as the sum of predictions of all the weak learners.

Some examples of boosting techniques are AdaBoost and Gradient Boosting Machine(GBM), a brief description of their work is given below :

Adaboost or adaptive boosting is a technique that is used as an ensemble method in machine learning. It is known as adaptive boosting because the weights are re-assigned at each instance, and higher weights are assigned to incorrectly classified instances.

It works on the principle where learners are added sequentially. Except for the first, each subsequent learner is grown from previously added learners. In simple words, weak learners are converted into strong learners by modeling on the residuals.

Adaboost algorithm also works on the same principle as boosting, but there is a slight difference in working. In boosting algorithms, only the incorrectly classified data points are sent to the next base learner. While in Adaboost, both records are allowed to pass.Though there is an increase in the weight of wrongly classified records and a decrease in the weight of correctly classified records.

To get deeper knowledge of Adaboost and see its working in a stepwise example please refer to the link :https://www.mygreatlearning.com/blog/adaboost-algorithm/

Gradient Boosting Machine(GBM) is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

source:https://data-flair.training/blogs/gradient-boosting-algorithm/

Input requirements for GBM :

1.A Loss Function to optimize.

2.A weak learner to make predictions (Generally Decision tree).

3.An additive model to add base learners to minimize the cost function.

Working: Gradient Boosting relies on the intuition that the best possible next model when combined with the previous models, minimizes the overall prediction errors.

Just like AdaBoost, GBM works by sequentially adding base learners to an ensemble, each one correcting its predecessor. However, instead of changing the weights for every incorrect classified observation after every iteration like AdaBoost, Gradient Boosting method tries to fit the new base learner to the residual errors made by the previous predictor.

Some important features of Gradient Boosting Machine :

1)Gradient Boosting is one of the boosting algorithms that is used to minimize bias error of the model.

2)We can tune the n_estimator of the gradient boosting algorithm. However, if we do not mention the value of n_estimator, the default value is taken as 100. (n_estimator is the number of decision trees)

3)When it is used as a regressor, the cost function is Mean Square Error (MSE) and when it is used as a classifier then Log loss is taken as a cost function.

For further information on Gradient Boosting Machine please refer to the link: https://www.analyticsvidhya.com/blog/2021/04/how-the-gradient-boosting-algorithm-works/

Now to answer the question with which we started this blog. How do Ensembling techniques overcome the drawbacks of Decision trees?

For decision trees, ensemble methods are one of the only efficient means of performing regularization on the model.

While using decision trees we face a conundrum: in order to model a complex set of data, we often need many levels of the tree , but as the tree grows deeper, it becomes prone to overfitting.

That’s because every time you add a new level to your tree, you’re adding a new predicate in the sequence.The chain becomes so specific that it’s very likely to apply very few training examples for each leaf. Also, the number of leaves grows exponentially with the number of levels, each successive level gets significantly more prone to overfitting.

So instead of using one tree for example with depth 50, we might use 10 trees each of depth 5 in a random forest. Each tree comes to its own prediction, and then voting methods is used for the final prediction where each tree gets a vote. That way our model becomes more robust and less overtly specific.

Advantages of using Ensembling technique

  1. It acts as a regularising parameter on overfitting models.
  2. it increases stability of our final model
  3. it increases the accuracy of our model drastically

Disadvantages of using Ensembling technique

  1. It is considered as a black box model and therefore in a tradeoff between accuracy and interpretability it becomes very difficult to interpret the model which is a vital part of any real-world solution.
  2. Also, ensemble techniques cost more to create, train, and deploy. The ROI(Return of Investment)of an ensemble approach should be considered while keeping its computational cost in mind.

Conclusion

Apart from the methods discussed in this article, it is common to use ensembles in deep learning by training diverse and accurate classifiers. Diversity can be achieved by varying model architectures, hyperparameter tuning, and training techniques.

Ensemble methods have been thriving in setting record performance on big datasets and are among the winners of Kaggle competitions. But at the same time, we need to keep in mind the high computational cost and low interpretability of these models.

Recommended reading :

--

--