Ensemble Methods: A Beginner’s Guide

Ravindra Sharma
Analytics Vidhya
Published in
5 min readJul 22, 2020

Introduction

When I started my Data Science journey,few terms like ensemble,boosting often popped up.Whenever I opened the discussion forum of any Kaggle Competition or looked at any winner’s solution,it was mostly filled with these things.

At first these discussions sounded totally alien,and these class of ensemble models looked like some fancy stuff not meant for the newbies,but trust me once you have a basic understanding behind the concepts you are going to love them!

Ensemble

So let’s start with a very simple question,What exactly is ensemble? I googled it up and came across this definition

“A group of separate things/people that contribute to a coordinated whole”

In a way this is kind of the core idea behind the entire class of ensemble learning! So let us now put it out in a more technical way,basically ensembles are a class of algorithms where a group of individual models(usually called the base learners) are grouped in certain ways to take an overall decision hence improving the predictive power of the overall model

Broadly speaking there are four types of ensemble techniques

  • Bagging
  • Boosting
  • Cascading
  • Stacking

(Here I would be discussing about the first two mostly)

Source: Datacamp

At its core the idea behind all four types is same,group of base learners are combined to take a decision,only difference lies in two aspects

  • What type of base learners are used
  • How are these base learners combined to take an overall decision

Now,let us try to answer these two questions for the above techniques

Bagging

Well let’s rewind the clocks a bit and go back to the school days for a while, remember you used to get a report card with an overall grade.Well how exactly was this overall grade calculated,your teachers of respective subjects gave some feedback based on their set of criteria,for example your math teacher would assess you on his own criteria like algebra,trigonometry etc, sports teacher would judge you how you perform on the field,your music teacher would judge on you vocal skills.Point being each of these teachers have their own set of rules of judging the performance of a student and later all of these are combined to give an overall grade on the performance of the student.

In a very layman terms,bagging does something like this only.Think of each teacher as a base model having their own criteria/features and decision from each base model/teacher is aggregated to calculate the overall prediction/grade.

So basically bagging is a technique where multiple base learners(for example decision trees) are trained in parallel and during prediction the results of these base learners are aggregated together through various methods like majority vote,mean etc to get a net outcome.Now imagine if you train every base learner(let’s say a decision tree) with the entire dataset there is a high chance these models would give the same results hence aggregating them would not add much value.To solve this each base learner gets to see a slightly different version of the original training dateset so that each model is trained differently and and when aggregated some meaningful results could be obtained. In order for each base learner to see a different version of the original dateset two techniques are used

  • Row Sampling
  • Column Sampling

Now instead of giving the entire dateset to each base learner a random subset of row,columns are given to them .One key thing to note here is we do bootstrapping(sampling with replacement) for rows only and not columns i.e each bootstrapped subset can have same rows multiple times but columns are unique(if we do bootstrap sampling for columns also then the base learners could face the problem of collinearity)

Source: ScienceDirect

Now each base learner is trained differently and this helps reducing net variance of the overall model(even if the base learners have a high variance but once results are aggregated the net variance of the model reduces).Now as an example Random Forest is one of the most popular bagging techniques which uses all the above principles discussed

The basic structure of a Random Forest model (Random Forests, Decision Trees, and Ensemble Methods Explained, by Dylan Storey)

Boosting

Now let’s move to one of the most widely used techniques in all the data science hackathons/competitions.The underlying working principle is similar to bagging,in bagging base learners take decision parallely and later this decision is aggregated whereas in boosting base learners takes decision sequentially(output of one model is used as an input to others) such that next model learns on the errors/mistakes and mistakes of previous models.

Now let us try to understand this by an intuitive example,let’s say a golfer is trying to hit the ball to a hole 500 m away.The first shot he would take would try to cover the maximum gap and come as close to the hole as possible.Let’s say after the first shot he is 80 m away now the golfer would reassess the target and adjust his swing accordingly to cover this residual distance.Let us assume he covers a further distance of 20 m,now in the next shot he would be closer to the hole and tap the ball more softly.

This is broadly how a boosting algorithm works,here each shot can be considered as a sequential base learner that is trained on the output/residual distance from the previous shot/base learner.

By the way doesn’t this approach sound familiar,converging to the target gradually and taking smaller and smaller steps as reaching closer to the the optima,if it reminds you of Gradient Descent then you are absolutely correct! In fact there is a very popular boosting algorithm called Gradient Boosted Decision Tree(GBDT),now you are free to derive your conclusions regarding nomenclature of that model.

AdaBoost or Adaptive Boosting is another example of boosting,the underlying principle is still the same,each model learns from the errors made by the previous models,here the only difference is instead of learning on residual errors,more weights are assigned to the misclassified points by the previous model,XGBoost is nothing extra but a more optimized implementation GBDT.

END NOTE

These ensemble models are very powerful techniques,that help us to build some strong Machine Learning models,having said that they are not some sort of magic wands that you could wave over to produce desired results every time,there some limitations to them as well having increasing the complexity of model comes at the cost of its interpretability,in order to develop good models one must not always rely blindly on these complex models,you still need to invest time in other important things in a Machine Learning Project like Exploratory Data Analysis,Feature Engineering,etc.

--

--

Ravindra Sharma
Analytics Vidhya

Data Science professional having experience in Finance Domain,who enjoys uncovering insights and create data driven stories to empower the business