Bagging. Unraveled.

Published in

Data Science Group, IITR

4 min readMar 22, 2017

In our previous post on Decision Trees, we introduced the world of Tree-Based Modeling to you. Here, we extend that to explain a new concept known as Bagging or Bootstrap Aggregating.

To start off, here are some bags for you:

**Yes, it is illegal. Focus on bags though!**

It is as simple as putting things in a bag. Of course, with some technicalities.

In the beginning, it was all black and white.

What is Bagging?

Ensemble Modelling: Divide and conquer. They involve a group of predictive models to achieve a better accuracy and model stability.

Ensemble basically refers to a group of items seen as a whole.

Bagging is an ensemble technique mainly used to reduce the variance of our predictions by combining the result of multiple classifiers modelled ondifferent sub-samples of the same data set.

Intuitively, let’s say, you want to carry 1000 kgs of potatoes from Delhi to Mumbai. Obviously, you can’t stock them in a single place. You divide them in bags of 100 kgs, transport them, and then aggregate after reaching. Same with data.

Main Steps

Creating multiple datasets: Sampling is done with replacement on the original data and new datasets are formed.
Building multiple classifiers: On each of these smaller datasets, a classifier is built, usually the same classifier on all the datasets.
Combining Classifiers: The predictions of all the individual classifiers are now combined to give a better classifier, usually with very less variance compared to before.

Random Forests

Wikipedia says “A forest is a large area dominated by trees.”

Need I say more? RF is a collection of Decision Trees, explained yesterday.

Used for both, classification and regression.
Versatile model, used with almost any problem and produces fair result.

Working

Multiple trees are grown, unlike a single tree in a simple decision tree.
While predicting on a new object, we predict using each of these trees separately. Then votes are taken from the trees to give final prediction.
(a) Classification: Class with most votes.
(b) Regression: Average output of all the trees.

Example

Let the total number of data points be N and the number of features be M.

A sample n of these N points is taken at random with replacement. This sample will be the training set for growing the tree.
If there are M input variables, some m<M is specified such that at each node, m variables are selected at random out of M. Best split on these mis used to split the node. m is held constant while we grow the forest.
New/test data is predicted by aggregating the predictions of all trees.

Is it really that good?

After DTs, we have read about a forest. It should be better. Let’s find out!

Pros

Higher dimensional data: At each tree node, we are reducing the features. Less overfitting.
As it predicts by aggregating the predictions from smaller predictors the variance decreases. Less overfitting.
Useful when missing data is huge.

Cons

Better with classification than regression.
Since it averages the predictions (like DT), doesn’t predict beyond the range in the training data.
Black box approach: Many factors are random.
Slight increase in Bias: Usually, more than compensated by decrease in variance, thus maintaining the trade-off.

Most Important Parameters

As it customary, without tweaking the main params, it’s nearly impossible to achieve a decent accuracy. Important ones:

n_estimators: Number of trees in the model. The larger the better, but the longer it will take to compute. (upto a limit)
max_features: Size of the random subsets of features to consider when splitting a node. Lower the #features, greater the reduction in variance, but greater the increase in bias.
Regression: max_features = n_features
Classification: max_features = sqrt(n_features)
feature_importances_: The relative importances of each feature to the model. Features used in the tree at the top nodes are relatively more important as more data points are dependent on that feature.

Other parameters can be understood here.

Implementation

The dataset used contains data of English Premier League (EPL/BPL) from the years 2010–2011 to 2013–2014. Now, Random Forest Classifier has been applied to predict the outcome of a match. Code can be found here.