Random Forest — A Concise Technical Overview
Random forest is one of the most popular and powerful machine learning algorithms. It is one of the algorithms that can used for both classification and regression tasks and therefore, it is one of the most used algorithms in the machine learning space.
Random Forest is a supervised learning algorithm. So what exactly is ‘Random Forest’? As the name suggests, this algorithm creates a ‘forest’ with a number of trees. The underlying logic of the algorithm is to have a higher number of trees in the forest to produce high accuracy in results. Simply put, Random Forest builds multiple(ensemble) decision trees and merges them together for an accurate and stable prediction.
The decision trees built are not just simple decision trees but rather bagged decision trees that are split on a subset of features on each split. Now, it is getting interesting, isn’t it?
Let us understand what are bagged trees. But before that we need to understand what is Bootstrapping.
Bootstrapping is a powerful statistical method for estimating a value from a data sample such as mean. For instance, sampling data and calculating the mean of each sample to average all of the calculated means to find a better estimation of the true mean value. In bagging technique, similar approach is utilized but instead of estimating a value, we estimate entire statistical models (decision trees). The training set is divided into multiple samples and models are constructed for each sample. Further, each such model makes a prediction for a new data and these predictions are averaged to give a better estimate of the true output value.
Usually a single decision tree suffers from over-fitting (high variance) but through the technique of bagging, we introduce many decision trees in the prediction process. This combines weak learners with strong learners thereby averaging away the variance (eliminating the risk of over-fitting)!
Random Forest performs better because of the bagging technique as it decorrelates the trees by splitting on a random subset of features. At each split, the model considers only a small subset of features rather than all the features. This randomized selection of fewer features at each split reduces the variance. If the dataset were to contain a few strong predictors, these predictors will get consistently chosen at the top of the trees and thus, form very similar structured trees. Bagged trees thus, prevent the domination of few select strong predictors and introduce randomness in the predictor selection.
Let us see how the Random Forest pseudo code works to make a prediction.
Step 1: Takes the test features and use the rules of each randomly created decision tree to predict the outcome and stores the predicted outcome as target.
Step 2: Calculate the votes for each predicted target.
Step 3: Consider the highest voted predicted target as the final prediction from the random forest algorithm.
To perform the prediction using the trained random forest algorithm we need to pass the test features through the rules of each randomly created trees. Suppose let’s say we formed 100 random decision trees.
Each tree will predict different target (outcome) for the same test feature. Then by considering each predicted target, votes will be calculated. Suppose the 100 random decision trees are predict 3 unique targets x, y, z then the votes in favor of x is out of 100 random decision tree how many trees predicted target as x. Likewise, we will count votes for the other 2 targets - y and z. For instance, if out of 100 random decision trees, 60 trees predict the target will be x, then the final random forest returns the x as the predicted target.
- Random forest maintains accuracy even when there is inconsistent data.
- Very handy and easy to use because its default hyper-parameters often produce good prediction output.
- It has methods for balancing error in class population unbalanced data sets. Random forest tries to minimize the overall error rate, so when we have an unbalance data set, the larger class will get a low error rate while the smaller class will have a larger error rate.
- Random Forest is not sensitive to outliers and non-linear data.
- As far as large data sets are concerned, the size of the trees can overload the processing memory.
- Interpretation is tricky as Random Forests are like black boxes and thus may impair model interpretation to a large extent.
- It is a predictive modelling tool and not a descriptive tool. If the goal is to find description of the relationships in the data, then other approaches would be preferred.