A brief idea on Ensemble Models-I (Random Forest)

A comprehensive guide to Ensemble model for Bagging technique in Machine learning.

Bhanu
Analytics Vidhya
4 min readAug 17, 2020

--

Ropes indicate the Base learners of the model and the statue indicates performance of the model.

Why Ensemble model?

As you might have gone through various machine learning models like KNN, Logistic regression, Decision trees etc,.(No worries if you haven’t gone through any of the models, we shall discuss them in the upcoming blogs. However, I have made it simple to understand this Ensemble model.) You might have got a doubt that, is there a way to combine all these models and build a new model? Yes, that is possible and this idea of combining together so as to build a powerful model is the concept of Ensemble model.

We can relate this Ensemble model to CSK team of IPL (Indian Premier League) and the players as Base learners. All the best players from different countries are grouped together so that we can build a powerful team. (This is just to get an idea on how powerful the Ensemble would be.)

There are four Ensemble strategies as follows,

  1. Bagging(Bootstrapped Aggregation)
  2. Boosting
  3. Stacking
  4. Cascading

Bagging:

Bagging word has come into picture with two words that is Bootstrapping(Row sampling with replacement) and Aggregation.

Bootstrapping is sampling the data points with replacement. Suppose if the give dataset has ’n’ points, we create ‘k’ samples of size(<n) from the given dataset.

Typically aggregation operation means applying the majority vote for Classification problem. For Regression, it would be computing mean or median of the predicted values.

The core idea is, given n data points, we have sampled into k samples which are m1,m2,m3,….mk samples. Now each sample is given to a model, this model is called Base learner(each model sees a different sample of data because only a subset of data is given to this model) and all these models are combined into one at Aggregation stage(If it’s a classification task, it simply takes a majority vote. Let us say there are 10 base models and if 6 of them predict that the class label is 1, we conclude the class label as 1, if its a regression problem we take mean of the predictions.)

As we are using aggregation with bootstrapping even if some data points are changed, only a subset of data is impacted. The overall result doesn’t change much, with this we are able to create reduced variance (Variance is nothing but, how much does a model change with changes in the training data set. If the model changes a lot, it’s high variance model.)model.

As you have got the intuition of Bagging concept, let’s discuss on Random Forest.

Random Forest (RF):

Random Forest is one of the most popular bagging techniques.

Random Forest name has come into the picture because forest is nothing but a group of trees. Hence the base learners we use here are Decision trees(with low bias and high variance). As we are doing Random sampling (Bootstrapping) on the data, hence the name Random. When combined, it is Random Forest.

Therefore, Random forest takes Decision trees are base models, applies bagging(row sampling with replacement) on top of it and also does column sampling(feature sampling).

RF: Decision Trees+Bagging(Row sampling with replacement)+Column Sampling(Feature sampling)

With row sampling and feature sampling, models are trained with different data sets i.e, if there are any changes in the data, as we are sampling, only fewer models are affected resulting in good overall performance.

Each base learners are with reasonable depth so that each model gets trained completely on the data resulting in a high variance model which results in overfitting(performs well on training data but fails on test data). Hence with aggregation, we are reducing the variance. As the number of base models (k) increase, variance reduces(less overfit) and vice-versa.

Hence k=number of base models can be termed as Hyperparameter.

NOTE: Column sampling or feature sampling is done without replacement as having same features twice introduces collinear features which results in the worst performance.

Code implementation: sklearn.ensemble.RandomForestClassifier()

Random Forest can be trivially parallelizable (can train each model on each core) as each decision tree is independent.

Therefore with Bagging, we are reducing the variance using High Variance and low bias Decision trees (Trees with more depth) as base models. As we are reducing variance with this approach, you might get a doubt, is there an Ensemble model to reduce bias? Yes, Boosting is the strategy we implement to reduce bias which we will be discussing in the upcoming blogs. So, there’s always a trade-off between Bias and Variance.

This is all about one of the Ensemble models. Please excuse if there are any mistakes and feel free to provide your valuable feedback so that I can improve on it. Thank you :)

Please refer this link for next blog on Ensemble model which is GBDT.

References:

  1. https://thecommonmanspeaks.com/statue-used-massage-egos/baahubali-statue/
  2. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn-ensemble-randomforestclassifier

--

--