Ensemble Learning And Their Methods

Published in

Analytics Vidhya

14 min readAug 28, 2020

The word Ensemble refers to a group of objects and viewing them as a whole. The same definition applies even for Ensemble modeling in machine learning in which a group of models are considered together to make predictions. As soon as we hear about Ensemble modeling we remember one of the popular Ensemble models called Random Forests which is based on the Bagging Ensemble technique. In this article, we will not be discussing in-depth about the Random Forests instead we will be focussing on topics surrounding Ensembles and the popular different Ensemble techniques.

That’s enough of a small introduction. Let’s start looking at the topics involved in Ensembles. Rather than diving directly into different popular Ensemble techniques, we will first understand the criteria or the conditions needed to be satisfied by the individual models in order to form an Ensemble and upon satisfying the conditions we will discuss how well an Ensemble will perform better than any individual model.

Conditions or Criterias needs to be satisfied by individual models to form an Ensemble

An ensemble can be formed by using different kind of models which are performing the same classification or regression task meaning if performing a classification task we can form an Ensemble considering Logistic Regression model, Decision Tree Classifier, KNN Classifier, Support Vector Classifier, etc. The same goes for the regression task as well. But in order to form an Ensemble by using different or same individual models, the models need to satisfy the conditions which are Diversity and Acceptability.

The term Diversity means that the individual models which are considered must be complementary to each other meaning their strengthens and weaknesses must nullify each other. Speaking in terms of machine learning lingo, if a particular individual model overfits and another individual model performs well then the good performing model will nullify the overfit effect and as a whole, the Ensemble will be performing better. Upon having Diversity between the models there exists independent nature between predictions made by each of the models meaning the predictions made by one model will not be affected by the predictions made by another model and vice-versa. Also, the overall variance in the model reduces, as a result, the Ensemble will be immune to overfitting issues.

After hearing the entire story about Diversity, we might get a doubt at this point regarding how to achieve Diversity between the models. This can be achieved by performing some of the many practices which are:

Considering the subsets of the training data and building different individual models on each of the subsets or bootstrap samples of the data.
Building different individual models of the same model class by tuning with different hyperparameter combinations.
Considering entirely distinct models to build meaning considering Logistic Regression, KNN, SVM, neural network to perform a classification task. The same follows with the regression task without any saying.
By considering a different subset of features for building individual models.
The last way would be by considering a different subset of data and a different subset of features to build individual models.

These are some of the many methods with which Diversity can be achieved between the models used to form Ensemble.

The term Acceptability means the individual models considered to form an Ensemble should be acceptable with each other to perform a task. In simple and statistical terms the probability of making correct predictions by the individual model should be better than any random model. The same statement can be quantified by saying as the probability of correct prediction by the individual model should be greater than 0.5.

It is necessary to have Diversity and Acceptability nature between the models otherwise the Ensemble model will be performing similar to any of the individual model which will not be creating any use for building an Ensemble. Let me give you a simple practical example that will make you believe that Diversity and Acceptability conditions are necessary to form an effective and good performing group. In a football team, there should be different diverse players like defenders, attackers, goalkeeper. Considering all these kind of players will help the team to perform better. If all players considered being defenders then a football team cannot be formed. Similarly, players to be acceptable with each other to perform a single accomplishment task of winning the game and each individual player should be performing better than any other normal person. I hope this example would give you some insights regarding Diversity and Acceptability.

Upon satisfying the above mentioned two conditions by individual models they can form an Ensemble. But at this moment you might be having a question like what is the guarantee that the individual models which have satisfied these conditions and formed an Ensemble would be performing better than any other individual model. This drives us to our next discussion.

Will the Ensemble model perform better than any individual model?

This question can be better answered by drawing parallels between the Ensemble model and a Biased coin toss. You might be wondering why I have considered Biased coin rather than Unbiased coin. Remember when building Ensemble we have considered individual models which makes correct prediction greater than 0.5 statistically. Also, the coin toss activity is an independent event activity meaning the probability of getting heads in the first trail will not affect the probability of getting tails in the second trail. These reasons made us to draw parallels between Ensemble and Coin toss.

Before diving into parallels between the Ensemble model and Biased coin toss we need to understand how does the Ensemble model makes predictions in classification or regression task. In the classification activity Ensemble model makes prediction considering the majority vote meaning if half of the models in the Ensemble predict a particular class label then that class label will be considered as the final prediction made by the Ensemble for a particular test data point. The strategy of considering the majority vote turns out to be an effective and valid approach. This can be proved by continuing with our parallel relation between the Ensemble and the Biased coin toss. Since we have considered a biased coin let’s consider the probability of getting heads to be much greater than the probability of getting tails. Let’s map getting heads to the correct prediction made by the Ensemble and tails to the wrong prediction made by the Ensemble. Now upon tossing the biased coin for N number of times the probability of getting tails more than half of the times would be very less and the probability of getting heads in half of the trails would be more. The same conclusion can be made on our Ensemble considering the mappings made, the probability of getting tails in more than half of the trails is low which implies the probability of making the wrong prediction by more than half of the models in the Ensemble will be also low. Similarly, the probability of getting heads in more than half of the trails will be high which implies the probability of making correct predictions by more than half of the models in the Ensemble will be high. Hence considering the majority vote as the aggregation technique will help the Ensemble to make correct predictions greater than any of the individual models. The same can be extended in the regression task without any saying.

If all this doesn't make any sense to you let’s understand the same using numbers. Consider a biased coin with P(Heads) = 0.7 and P(Tails) = 0.3. Consider 3 individual models (m1, m2, m3) that satisfy the conditions and form an Ensemble. Let’s map the heads to the correct prediction and tails to the wrong prediction. Following the majority vote aggregation technique to make the final prediction meaning prediction made by half or greater models will be considered as the final prediction. So, in our example predictions made by ≥ 2 models will be considered as the final prediction. Upon performing coin toss for three trails the possible combinations would be:

Let’s now compute the probability of making correct (p) and incorrect predictions(q) by the Ensemble and the results as follows:

Probability of correct and incorrect predictions by the Ensemble

From the above results, we can see that the probability of making correct predictions by the Ensemble considering the majority vote aggregation strategy turns out to be more by ~8% than the probability of making correction predictions by any individual model which is 0.70. The same applies to the probability of making incorrect predictions by the Ensemble is less by ~9% than the probability of making the wrong predictions by any individual model which is 0.3. Hence it is proved that the Ensemble models perform better than any other individual model.

Considering more models within the Ensemble will create more differences in probabilities of making correct and incorrect predictions by the Ensemble compared to any individual model. Hence greater the number of models considered to form an Ensemble the superior the Ensemble performance.

Having understood the foundational concepts about the Ensemble it is now time to understand the most popular Ensemble techniques.

Popular Ensemble Modeling techniques

Some of the popular Ensemble techniques are:

Voting / Average
Stacking and Blending
Bagging
Boosting

Let’s discuss each one of the methods.

Voting / Average

In the Voting method, we build the Ensemble considering different individual models which satisfies the conditions and consider final prediction for a test data point as the majority vote meaning provides final prediction by the Ensemble as the prediction made by more than half of the models. The voting strategy will be used when performing a classification activity. When performing a regression task the final prediction by the Ensemble will be considered as the average of all the predictions made by the individual models. We have already seen why Voting and Average would be the better choice in making predictions by the Ensemble. One important caveat that needs to be remembered when performing the voting / average Ensemble technique is that in this technique the predictions made by each of the individual models have been given equal importance or weightage in making the final prediction. But doing so the model will not be performing at its best potential as some of the individual models in the Ensemble might be performing better than other models. To overcome this drawback we can start assigning weights to the predictions made by the individual models. This idea drives us to the next popular Ensemble techniques called Stacking and Blending.

Stacking and Blending

There exists a subtle difference between Stacking and Blending techniques in terms of training data or the meta-features sampled for the level 2 model. In this method of Ensemble technique, we start assigning weights to the predictions made by the individual model either by manual assignment or by considering a level 2 model. The predictions made by the models used in level 1 are sampled and given as the training data to the models present in level 2. Doing so the models present in level 2 will be trained on the outputs of the level 1 models and results in final predictions by assigning some weights to the predictions made by the models in level 1. If all this seems to be going over your head let’s have the pictorial representation of the process.

There is no restriction in considering only two levels of models. The process can be made even more complex by considering more number of levels and more number of models within a level. But considering two levels of models would be sufficient and yields better results as well.

Having understood the high-level process of Stacking and Blending let’s discuss in detail stating with the Stacking method followed by Blending.

Stacking is also called “Stacked Generalization”. The overall technique will be the same as discussed above with just a small addition to the sampling of meta-features which are provided as the training data to the level 2 models. Rather than directly considering the predictions made by level 1 models as the meta-features or training data for the level 2 models we consider cross-validated sample predictions as the training data for the level-2 models meaning the original training data upon which the level-1 models are trained is split into k-folds having k-levels and in each of the kth level kth fold is considered as the holdout or validation set and remaining (k-1) folds in a kth level are used to train different models of level-1. Upon training different models of level-1 on (k-1) folds the predictions are made by different models on the kth fold in the kth level. The same approach is followed in all the k levels. Likewise, the predictions made by all the models in level-1 on the validation set are considered as the training data for the level-2 models. Hence the shape of the training data for the level-2 models will be (m x M) where m refers to the number of rows in initial training data and M refers to the number of models in level-1. For better understanding refer to the below image which represents the split of main training data into 5 folds having 5 levels.

In Blending also the overall approach remains the same with slight addition in the sampling of meta-features or training data for the level-2 models. In this method rather than considering training data or meta-features for level-2 models as k-fold validation set predictions, we prefer to have predictions made on a single validation set. These predictions made on the single validation set are supplied as the meta-features for the level-2 models. Doing so the supplied meta-features will be limited as a result the model performance will not be better than Stacking. Also, the fixed single validation split of training data will eat up the available training data for training the models in level-1. Hence Stacking method is more preferred over Blending. For better understanding refer to the below image which showcases a single split of the main training data into training and validation or holdout set.

Bagging

Bagging is one of the popular and powerful Ensemble technique used to reduce the variance in the model. Hence it is preferably used with high variance models. It can be used with any model but makes sense when used with high variance models. Random Forests one of the popular and powerful Ensemble model is based on the Bagging technique.

It is also called “Bootstrap Aggregation” in which different bootstrap samples of training data are created meaning the training data is split into different subsets of the data with replacement. Different subsamples of training data may or may not have some of the data points overlapped. Upon creating different random bootstrap samples different models belonging to the same model class with the same hyperparameter combination are fitted on each of the bootstrap samples. In this Ensemble technique, it is just that different models of the same model class and same hyperparameter tuning are being exposed or trained on different bootstrap samples. Once different models of the same model class with the same hyperparameter combination are fitted, the final predictions are considered as the majority vote if performing a classification task or the average of predictions is considered as the final prediction if performing a regression task. Hence it is called “Bootstrap Aggregation”.

As discussed already in the methods to bring in Diversity between the models we are considering in this method different subsets of the training data to bring in diversity between the models which is also created by having different predictions developed on different subsets of the training data. For a better understanding of the process let’s look at the pictorial representation of the process.

Where C represents different models of the same model class with the same hyperparameters combinations being trained on different subsets of the training data.

Having understood the process, let’s look at the advantages and disadvantages of the process.

The advantages as follows:

It is best suitable for reducing high variance in the models like Decision Trees, SVM, Neural networks, etc.
The complete process can be parallelizable meaning the building process can be performed on multiple cores with having built one model after another. Since each of the models built are diverse and their predictions made are independent to each other the process can be well parallelizable.
Since the building process can be parallelizable, the complete process is fast.

Now let’s look at the disadvantages:

Loss of interpretability: Since we are building an Ensemble model which is a collection of different individual models we cannot make inferences. Consider we are building an Ensemble model using the Bagging technique with Decision trees. Upon building the Ensemble we cannot look at the Decision Tree as a whole though we may look at the individual decision trees but the interpretation made from the single decision tree in the Ensemble cannot be generalized for the entire model. If that would be the scenario then we are the loss of Diversity between the models as in such case all the models will be behaving the same. Though we are not able to make any inferences using the Ensemble but we will be knowing about the feature importance which determines which features have provided a significant contribution in making predictions about the target variable.
Feature Dominance: If there exists any of the dominance of the feature then the same feature will be appearing at all node splits if building a Decision tree models as a result all the models will be behaving the same and hence again we are the loss of Diversity between the models.

Boosting

Boosting is also one of the powerful Ensemble technique. Some of the powerful boosting algorithms are:

AdaBoost
Gradient Boosting
Extreme Gradient Boosting aka XG Boosting and many more.

Upon coupling these Boosting methods with any of the models the models will be performing at their best potential. These Boosting methods coupled with any of the classical machine learning methods used to be on the top leaderboards in Kaggle competitions. We will not be going in each of the boosting techniques as they correspond to a completely different discussion which involves complete math. If you are interested in knowing specifically about XG Boosting do check out my article on XG Boosting: https://medium.com/@varunimmidi/xg-boosting-algorithm-cf99fd8f7468?source=friends_link&sk=67da9904a586270d48662c669c751cf7 which explains everything involved in Boosting and XG Boosting in detail.

Let’s limit our discussion in this article about the high-level boosting technique.

In Boosting we combine different weak learners in sequential order such that the errors made by the previous model in the sequence are overcome by the subsequent model in the stump. The definition of a weak learner is a model that is just able to identify the dominant patterns present in the data and the probability of making correct predictions is greater than 0.5. The performance of the weak learner would be minimal and all these weak learners are combined in sequential order to form a strong learner. The reason for considering specifically weak learners because the combination of all these weak learners will not make the Ensemble overfit and also minimum data requirement would be needed to train any of the weak learners and the training process would be also fast. For a better understanding of the process have a look at the below image in which successive models are built such that each subsequent model in the stumps overcomes the errors made by the previous model.