Bootstrapped Aggregation(Bagging):

4 min readJan 28, 2022

Bootstrap Aggregating, also knows as bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It decreases the variance and helps to avoid overfitting.

Description of the Technique:

Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X (unknown sample).

Implementation Steps of Bagging:

Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations with replacement.
Step 2: A base model is created on each of these subsets.
Step 3: Each model is learned in parallel from each training set and independent of each other.
Step 4: The final predictions are determined by combining the predictions from all the models.

*An illustration for the concept of bootstrap aggregating (Bagging)*

Example of Bagging

The Random Forest model uses Bagging, where decision tree models with higher variance are present. It makes random feature selection to grow trees. Several random trees make a Random Forest.

Random Forest Algorithm:

Random Forest is a popular machine learning algorithm that belongs to the supervised learning technique. It can be used for both Classification and Regression problems in ML.

“Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset.” Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.

The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.

Assumptions for Random Forest:

Since the random forest combines multiple trees to predict the class of the dataset, it is possible that some decision trees may predict the correct output, while others may not. But together, all the trees predict the correct output. Therefore, below are two assumptions for a better Random forest classifier:

There should be some actual values in the feature variable of the dataset so that the classifier can predict accurate results rather than a guessed result.
The predictions from each tree must have very low correlations.

Working of Random Forest Algorithm:

We can understand the working of Random Forest algorithm with the help of following steps:

Step 1 : First, start with the selection of random samples from a given dataset.

Step 2 : Next, this algorithm will construct a decision tree for every sample. Then it will get the prediction result from every decision tree.

Step 3 :In this step, voting will be performed for every predicted result.

Step 4 : At last, select the most voted prediction result as the final prediction result.

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given to the Random forest classifier. The dataset is divided into subsets and given to each decision tree. During the training phase, each decision tree produces a prediction result, and when a new data point occurs, then based on the majority of results, the Random Forest classifier predicts the final decision. Consider the below image:

Bias and Variance in Random Forest:

Random forest is an ensemble bagging technique where number of decision trees combine to give the result. The process is a combination of bootstrapping and aggregation. The main idea behind random forest is that lots of high variance and low bias trees combine to generate a low bias low variance forest. Since it is distributed over different trees and each tree sees different set of data , therefore random forest in general do not over fit. And since they are made of low bias trees, underfitting also does not happen.

The complexity of Random Forest:

Training Time Complexity= O(n*log(n)*d*k)
k=number of Decision Trees
Notes: When we have a large number of data with reasonable features. Then we can use multi-core to parallelize our model to train different Decision Trees.

Run-time Complexity= O(depth of tree* k)
Space Complexity= O(depth of tree *k)
Note: Random Forest is comparatively faster than other algorithms.

Extremely randomized trees:

In Regular Decision trees, we will try out all the values of the threshold and get the best possible value of the threshold for real-valued features, that increase the information gain.

In Randomizes decision trees, some values of the bag are selected randomly and then we come up with a threshold. This is our final threshold of the base model.

Advantages and Disadvantages of Random Forest

It reduces overfitting in decision trees and helps to improve the accuracy
It is flexible to both classification and regression problems
It works well with both categorical and continuous values
It automates missing values present in the data
Normalising of data is not required as it uses a rule-based approach.

Disadvantages:

It requires much computational power as well as resources as it builds numerous trees to combine their outputs.
It also requires much time for training as it combines a lot of decision trees to determine the class.

Reference:

GeeksforGeeks
Javapoint
TowardsDataScience
my great learning