Bagging, Pasting, Random Subspaces, Random Patches, Random Forest, Extra-Trees, Out of Bag, Feature Importance

Ensemble Techniques Part 1-Bagging & Pasting

Theoretical Intuition behind ensemble Techniques with implementation in scikit learn

Deeksha Singh
Geek Culture

--

Photo by charlesdeluvio on Unsplash

“The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator.

We have data and we are training a single ML model for prediction which may often result in low accuracy. But what if we aggregate the prediction of a group of models(Regressor or Classifier)? Answer would be, we will often get better prediction than best individual predictors. The grouping of individual predictors (weak learners) having low accuracy results in ensemble (strong learners) achieving high accuracy with reduced variance provided that there are sufficient number of weak learners and they are sufficiently diverse(less correlated).

Source: Oreilly ‘s Hands-On machine learning with Scikit-learn, Keras & Tensor flow

The most popular Ensemble methods are:

  1. Bagging (Bootstrap Aggregation)
  2. Boosting Technique
  3. Stacking

In this article, we will be looking only into Bagging & Pasting method with its special case of RandomForest algorithm & Extra Tree Method but before that lets see how predictions from individual predictors will be aggregated. For regression it will be mostly average of all the predictions of individual independent estimators while for classification problem it can have hard voting or soft voting method.

Hard Voting Classifiers: It is a majority vote classifier to predict the class labels. Idea behind this is, the prediction of each output will be aggregated and the class which got most votes will be the final predicted output. For image shown below, majority of classifiers are predicting class 1 and one classifier is predicting class 2 so according to majority vote classifier class 1 will be the final predicted output.

Source: Oreilly ‘s Hands-On machine learning with Scikit-learn, Keras & Tensor flow

Soft Voting Classifiers: It measures average predicted probabilities (soft vote). It will predict the output class with highest class probability which is averaged over all the individual classifiers. For below example the individual models are predicting probabilities for class 1 and class 2 and the most of the models have highest probability of class 1 and hence the average of all highest probabilities will be 0.65 for class 1. Sometimes it assigns weighs to the each classifier by weights parameter. Weights will be directly proportional to accuracy of individual classifiers.

Credit: Author

Soft voting achieves higher performance than hard voting because it gives more weightage to highly confident votes.

Note: Ensemble techniques works best when all the individual models are independent from one another as possible, making uncorrelated errors which won’t be possible if they are getting trained on the same data making same type of error and hence reducing ensemble accuracy. Best way to get diverse classifier is to train them using different algorithms which leads different types of errors, improving the ensemble accuracy.

Now moving on, lets look into Bagging techniques, Random Forest and Extra-Trees method.

1.Bagging: In this approach, same training algorithm is used as a predictor in ensemble but all these predictors are trained on different random subsets of training datasets. The minor difference between bagging and pasting is that sampling of training dataset is performed with replacement in bagging (bootstrap=True) while without replacement in pasting (bootstrap=False). So only bagging allows training instances to be sampled many times for predictors.

Source: Oreilly ‘s Hands-On machine learning with Scikit-learn, Keras & Tensor flow

Generally each individual predictors have higher bias but aggregation will net result in ensemble that would have similar bias but lower variance than single predictor trained on same training dataset. Also BaggingClassifier automatically perform soft voting. All predictors are trained in parallel as well as perform parallel computation of prediction via different CPU cores (n_jobs parameters in scikit learn tells the number of CPU cores to be used for training and prediction, -1 tells to use all available cores).

Like sampling instances we can sample features also so that each predictors can be trained on random subset of input features. BaggingClassifier provide this by parameter bootstrap_features & max_features. The two techniques are:

Random Subspaces: Keeping all the training instances (bootstrap=False) & (max_samples=1) but sampling features (bootstrap_features=True) and max_features to a value smaller than 1) is called Random subspaces method.

Random Patches: Technique of sampling both training instances and features is called Random Patches.

Out-of-Bag Evaluation(OOB): In bagging we have seen training is done using replacement of training instances. So there are chances that some instances get sampled many times and some didn’t got sampled at all. Mathematically, it is seen that 63% of training instances are sampled on average for each predictors while 37% never get sampled at all and called as out-of-bag(oob) instances. Since these instances have never seen by predictor, predictor performance can be evaluated using these instances without using validation set. To check the evaluation score following code will be used having 500 decision tree predictors with 100 training instances using all cores for parallel processing.

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bagging_clf=BaggingClassifier(DecisionTreeClassifier(),n_estimators=500,max_samples=100, bootstrap=True, n_jobs=-1,oob_score=True)bagging_clf.fit(X_train, y_train)
y_pred = bagging_clf.predict(X_test)
bagging_clf.oob_score_

Random Forests Algorithm

It is an ensemble of Decision Trees trained via bagging technique with max_sample set to size of training set. The main goal of forest estimator is to reduce the variance because individual tree tends to overfit and exhibit high variance. It introduce extra randomness in growing tree because it searches for the best feature out of random subset of features selected instead of searching for best feature out of total features when splitting at a node. This would lead in greater tree diversity which sometimes leads to increase in bias for low variance thus resulting in overall better model.

Instead of building BaggingClassifier and then passing it to a DecisionTreeClassifier, we can directly use RandomForestClassifier from sklearn.ensemble.

Note: Random forest helps in performing feature selection by measuring feature importance on looking at how much the tree nodes that use that feature reduces impurity on average.

Extremely Randomized Trees(Extra-Trees)

In this extra randomness is added by the way splits are computed. It uses random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do) and then the best of these randomly-generated thresholds is picked as the splitting rule. This techniques allow to reduce variance at expense of slight increase in bias.

So this bought us to the end of bagging techniques and in next few articles we will be diving into different Boosting techniques and its different types.

If there is any mistake you find, please do comment. Also, DM’s are open. Happy learning.

References:

  1. https://scikit-learn.org/stable/modules/ensemble.html
  2. O’Reilly Media, Inc. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow.
  3. Krish Naik Machine Learning Playlist On You-tube.

--

--