Random Forest

Understanding Random Forest

Gajendra
6 min readJul 7, 2022

Random, chosen without method or conscious decision, Forest, collection of trees, is an ensemble learning method for classification and regression tasks. It is a collection of multitude of decision trees.

Before we try to understand random forest lets understand ensemble learning.

Ensemble Learning

Ensemble learning is a machine learning technique that combines several base models in order to produce one optimal predictive model (powerful model).

In Random Forest, ensemble methods allow us to take a sample of decision trees into account, calculate which features to use or questions to ask at each split and make a final predictor based on the aggregated results of the sampled decision trees.

Ensemble Methods are mainly used to optimize Bias and Variance in a dataset.

Bias and Variance

Bias

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.

Variance

Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.

Ideally the purpose of optimization is to achieve low bias and low variance. This is also called known as Bias — Variance Tradeoff

Overfitting is one of the most common problem with decision trees.

A model is said to have high variance if the model changes a lot with changes in training data.

Types of Ensemble Methods

  1. BAGGing or Bootstrap AGGregation
  2. Boosting
  3. Stacking
  4. Cascading

BAGGing or Bootstrap AGGregation

Bagging is one of the common technique to prevent decision trees model from overfitting. Each model is built at a different subset of data. So bagging is a concept to reduce variance in the model without impacting bias.

Bagging is a parallel ensemble learning model where we train multiple models in parallel.

Bagging is a combination of bootstrap and aggregation.

Bagging

Bootstrap is a sampling technique with replacement. We take a sample from the original dataset and then put the sample back into original dataset before taking out next sample.

Aggregation is a method of combining all the output from different models to generate one output, best output.

Let’s understand it with an example,

Original Dataset

Let’s not confuse the original dataset with the population dataset. The original dataset is sample dataset taken from the population.

Original Dataset

Bootstrap

Bootstrap is applied automatically by ensemble models. It takes the original dataset and creates a training set by randomly selecting rows from the original dataset. The rows not selected as part of training are used for cross validation, test dataset.

Bootstrap Dataset

Notice, how some of the data in the training set are repeated. This could happen in bootstrap and is considered normal.

Also, the training size will remain same as the original dataset while the test size can vary as only the data not part of training set goes into test dataset.

Aggregation

Models are trained on the training dataset and validated against the test dataset. The output are then aggregated,

  • For classification tasks, the output of the random forest is the class selected by most trees.
  • For regression tasks, the mean or average prediction of the individual trees is returned.
Bagging

Random Forest

Note: If you are not familiar with decision trees please check this out. CART.

Random Forest is an extension of ensemble learning. It is a collection of multitude of decision trees. Decision trees are very sensitive to even small changes in data. Random Forest prevents this by allowing a whole bunch of decision trees to work together to get a better and more robust prediction.

Bagging only considers row sampling with replacement but the random forest also considers column or feature bagging.

In general, the Random Forest can be expressed as -

Random Forest

Let’s understand it with an example,

Original Dataset

Similar to bagging, here we have a dataset but we have included some additional features instead of just one in the bagging example.

Original Dataset

Bootstrap

After bootstrap we will get below training and test dataset,

Three features, 1, 2 and 3 out of four features are randomly chosen as part of feature selection along with row sampling as part of bagging.

Bootstrap 1

Three features, 1, 2 and 4 out of four features are randomly chosen as part of feature selection along with row sampling as part of bagging.

Bootstrap 2

Two, 1 and 4 out of four features are randomly chosen as part of feature selection along with row sampling as part of bagging.

Bootstrap 3

This is how random forest applies feature bagging along with bagging.

Aggregation

Models are trained on the training dataset and validated against the test dataset. The output are then aggregated,

  • For classification tasks, the output of the random forest is the class selected by most trees.
  • For regression tasks, the mean or average prediction of the individual trees is returned.
Random Forest

How Random Forest Works?

Suppose there are N observations and M features in the training data set,

Dataset

Step 1: Bagging (Row Sampling With Replacement)

Draw a bootstrap sample Z of size n from the training data DS(N, M).

Sample

Step 2: Feature Bagging (Column Sampling)

Grow a random forest tree T to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size set by threshold is reached.

Step 2.1

Select m variables/features at random from the M features.

Feature Selection

Step 2.2

Node splitting, or simply splitting, is the process of dividing a node into multiple sub-nodes to create relatively pure nodes. There are multiple ways of doing this, which can be broadly divided into two categories based on the type of target variable

Continuous Target Variable
- Reduction in Variance

Categorical Target Variable
- Gini Impurity
- Information Gain
- Chi-Square

Step 3: Aggregation

Random Forest uses aggregation technique to calculate the final output.

  • For Regression: Mean/Median
  • For Classification: Majority Votes

Extremely Randomized Tree

There is an additional step involved in extremely randomized tree i.e. randomization in selecting threshold while building a decision tree. Extremely Randomized Tree are not very popular.

Extremely Randomized Tree

I hope this article provides you with a good understanding of Random Forest.

If you have any questions or if you find anything misrepresented please let me know.

Thanks!

Credit: Andrew Ng, StatQuest and Krish Naik — special thanks to these great contents.

--

--

Gajendra

| AWS MLS, SAA, CLF | MIT - ADSP | Software Engineer | Data Scientist | Machine Learning | Artificial Intelligence | Hobby Blogger |