Ensemble Methods in Machine Learning

Published in

Analytics Vidhya

6 min readFeb 14, 2021

In this article, we will try to get familiar with different ensemble techniques and some common algorithms in it.

An ensemble method is where an algorithm or model is used multiple times on multiple samples from the training dataset. “The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm to improve generalizability/ robustness over a single estimator.”

For example, if you were given a task ‘A’ with a team of 5 members, you will have multiple options to do this task.
Option 1: You try to do it alone and you would find an easier way to do it and end up not giving good results(underfitting).
Option 2: Or you may do it perfectly for that task with full dedication, but when assigned a similar new task ‘B’ you will be clueless or will have to train yourself again for that task(overfitting).
Option 3: You divide your task into your team members and then combine all of them giving good results. Now when a similar task is assigned again, the same members can do it more efficiently.

From the above three options, you are smart enough to go with Option 3. Right? How about if we could similarly train models. These ensemble methods allow us to do the same.

Ensemble learning is one way to tackle bias-variance trade-off. A good model should maintain a balance between these two types of errors. This is known as the trade-off management of bias-variance errors.

Ensemble methods can mainly be divided into two groups:
1. Parallel ensemble methods
2. Sequential ensemble methods

1. Parallel ensemble methods :

In the Parallel ensemble method, the base learners or the algorithm are generated in parallel. Several estimators or models are built independently and then average their predictions. On average, the combined estimator is usually better than any of the single base estimators because its variance is reduced.
The variance of the combined predictions is reduced to 1/n (n is the number of models or samples).

One of the most common methods in the Parallel ensemble technique is Bootstrap Aggregating (Bagging).

Bootstrap Aggregating (Bagging) :

In this method, models are generated using the same algorithm with random sub-samples of the dataset with the bootstrap sampling method to reduce the variance. In bootstrap sampling, some original examples appear more than once and some original examples are not present in the sample.

The bagging technique is useful for both regression and classification.

In regression, it takes the mean of all models and in classification, it considers votes of each model.

Random forest is one of the popular bagging algorithms.

Random Forest (Bagging Algorithm) :
In a random forest at each sample, a decision tree is used which collectively form a forest and hence, Random Forest. Rest is almost similar to a simple decision tree.

Main hyperparameters to tune while using Random Forest :
n_estimators: No of Trees
max_features: No of features to be considered for sampling.

Other hyperparameters are
max_depth: max depth of the tree (default = ‘None’)
min_samples_split : no of split at nodes (default = 2).
criterion: Function to measure the quality of split (default =’ Gini’)

However, the resultant model can experience lots of bias when the proper procedure is ignored. Also, it can be computationally expensive.

2. Sequential ensemble methods ( Boosting Algorithm):

In the sequential ensemble method, the base learners are generated sequentially. The overall performance can be boosted by weighing previously mislabeled examples with higher weight. Overall it reduces the bias of the combined estimator.
In simple words, boosting refers to a family of algorithms that can convert weak learners to strong learners. The main principle of boosting is to fit a sequence of weak learners/ models that perform better in different versions of the data. More weight is given to examples that were misclassified by earlier rounds.
Here each model is dependent on the previous model.

In Boosting ML model is built at each stage on misclassified data. Finally, all correct classifiers of each stage are aggregated to form a strong learner.

Algorithms :

1. Gradient Boosted Decision Trees (GBDT)

Uses Gradient Descent Algorithm on Decision trees to reduce loss at each level. It generalizes models by allowing optimization of an arbitrary differentiable loss function. The next model is built on the errors of the previous model.

2. XG Boost

XG Boost works similar to GBDT, but it has some more features(below) which makes it more efficient.
i) Regularization: Standard GBM implementation has no regularization like XGBoost, therefore it also helps to reduce overfitting. In fact, XGBoost is also known as the ‘regularized boosting‘ technique.
ii) Parallel Processing: XGBoost implements parallel processing at root nodes and becomes faster as compared to GBM.
iii) High Flexibility: XGB allows for custom evaluation criteria.
iv) Handling Missing Values: Can handle missing values on its own.
v) Tree Pruning: XGB does tree pruning on its own and reduces variance.
vi) Built-in Cross-Validation: XGBoost allows the user to run cross-validation at each iteration of the boosting process.
vii) Continue on Existing Model: User can start training an XGBoost model from its last iteration of the previous run.

3. Ada Boost

Here, usually, each weak learner is developed as decision stumps (A stump is a tree with just a single split and two terminal nodes) that are used to classify the observations.
After each classifier is trained, the classifier’s weight is calculated based on its accuracy. More weight is assigned to the classifier when accuracy is more and vice-versa.

4. Light GBM

It splits the tree leaf-wise with the best fit whereas other boosting algorithms split the tree depth-wise or level-wise rather than leaf-wise and tries to get a pure leaf as soon as possible. This makes Light GBM faster than other boosting algorithms and hence the word ‘Light’.

5. Cat Boost

“CatBoost” name comes from two words “Category” and “Boosting”. CatBoost converts categorical values into numbers using various statistics on combinations of categorical features and combinations of categorical and numerical features. Hence we need not convert categorical variables to numerical like in other models.
It grows a balanced tree and can handle missing values as XGB does.

3. Stacking

Stacking is an ensemble learning technique that combines multiple classifications or regression models via a meta-classifier or a meta-regressor. The base-level models are trained based on a complete training set, then the meta-model is trained on the outputs of the base level model as features.

I hope this article was helpful to get a basic understanding of different ensemble methods and common algorithms used. Please leave your queries if any below.