Random Forest
Random, chosen without method or conscious decision, Forest, collection of trees, is an ensemble learning method for classification and regression tasks. It is a collection of multitude of decision trees.
Before we try to understand random forest lets understand ensemble learning.
Ensemble Learning
Ensemble learning is a machine learning technique that combines several base models in order to produce one optimal predictive model (powerful model).
In Random Forest, ensemble methods allow us to take a sample of decision trees into account, calculate which features to use or questions to ask at each split and make a final predictor based on the aggregated results of the sampled decision trees.
Ensemble Methods are mainly used to optimize Bias and Variance in a dataset.
Bias and Variance
Bias
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.
Variance
Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.
Ideally the purpose of optimization is to achieve low bias and low variance. This is also called known as Bias — Variance Tradeoff
Overfitting is one of the most common problem with decision trees.
A model is said to have high variance if the model changes a lot with changes in training data.
Types of Ensemble Methods
- BAGGing or Bootstrap AGGregation
- Boosting
- Stacking
- Cascading
BAGGing or Bootstrap AGGregation
Bagging is one of the common technique to prevent decision trees model from overfitting. Each model is built at a different subset of data. So bagging is a concept to reduce variance in the model without impacting bias.
Bagging is a parallel ensemble learning model where we train multiple models in parallel.
Bagging is a combination of bootstrap and aggregation.
Bootstrap is a sampling technique with replacement. We take a sample from the original dataset and then put the sample back into original dataset before taking out next sample.
Aggregation is a method of combining all the output from different models to generate one output, best output.
Let’s understand it with an example,
Original Dataset
Let’s not confuse the original dataset with the population dataset. The original dataset is sample dataset taken from the population.
Bootstrap
Bootstrap is applied automatically by ensemble models. It takes the original dataset and creates a training set by randomly selecting rows from the original dataset. The rows not selected as part of training are used for cross validation, test dataset.
Notice, how some of the data in the training set are repeated. This could happen in bootstrap and is considered normal.
Also, the training size will remain same as the original dataset while the test size can vary as only the data not part of training set goes into test dataset.
Aggregation
Models are trained on the training dataset and validated against the test dataset. The output are then aggregated,
- For classification tasks, the output of the random forest is the class selected by most trees.
- For regression tasks, the mean or average prediction of the individual trees is returned.
Random Forest
Note: If you are not familiar with decision trees please check this out. CART.
Random Forest is an extension of ensemble learning. It is a collection of multitude of decision trees. Decision trees are very sensitive to even small changes in data. Random Forest prevents this by allowing a whole bunch of decision trees to work together to get a better and more robust prediction.
Bagging only considers row sampling with replacement but the random forest also considers column or feature bagging.
In general, the Random Forest can be expressed as -
Let’s understand it with an example,
Original Dataset
Similar to bagging, here we have a dataset but we have included some additional features instead of just one in the bagging example.
Bootstrap
After bootstrap we will get below training and test dataset,
Three features, 1, 2 and 3 out of four features are randomly chosen as part of feature selection along with row sampling as part of bagging.
Three features, 1, 2 and 4 out of four features are randomly chosen as part of feature selection along with row sampling as part of bagging.
Two, 1 and 4 out of four features are randomly chosen as part of feature selection along with row sampling as part of bagging.
This is how random forest applies feature bagging along with bagging.
Aggregation
Models are trained on the training dataset and validated against the test dataset. The output are then aggregated,
- For classification tasks, the output of the random forest is the class selected by most trees.
- For regression tasks, the mean or average prediction of the individual trees is returned.
How Random Forest Works?
Suppose there are N observations and M features in the training data set,
Step 1: Bagging (Row Sampling With Replacement)
Draw a bootstrap sample Z of size n from the training data DS(N, M).
Step 2: Feature Bagging (Column Sampling)
Grow a random forest tree T to the bootstrapped data, by recursively repeating the following steps for each terminal node of the tree, until the minimum node size set by threshold is reached.
Step 2.1
Select m variables/features at random from the M features.
Step 2.2
Node splitting, or simply splitting, is the process of dividing a node into multiple sub-nodes to create relatively pure nodes. There are multiple ways of doing this, which can be broadly divided into two categories based on the type of target variable
Continuous Target Variable
- Reduction in Variance
Categorical Target Variable
- Gini Impurity
- Information Gain
- Chi-Square
Step 3: Aggregation
Random Forest uses aggregation technique to calculate the final output.
- For Regression: Mean/Median
- For Classification: Majority Votes
Extremely Randomized Tree
There is an additional step involved in extremely randomized tree i.e. randomization in selecting threshold while building a decision tree. Extremely Randomized Tree are not very popular.
I hope this article provides you with a good understanding of Random Forest.
If you have any questions or if you find anything misrepresented please let me know.
Thanks!
Credit: Andrew Ng, StatQuest and Krish Naik — special thanks to these great contents.