Deep Dive on Random Forest.

Akhilesh Rai
MindOrks
Published in
3 min readNov 16, 2018

(Understanding Random Forest in Step by Step manner.)

I often find people talking a lot about random forest, even in interviews candidates choose Random Forest as their preferred algorithm but what I do not see is even after using Random Forest quite often in their work, people do not understand its basics.

Here in this article I am assuming that the reader is well versed with Decision Tree and has good Machine Learning basics. My approach would be to explain each step of Random Forest algorithm with examples.

Random Forest is an ensemble algorithm, as other algorithms like Decision Tree, Neural Networks etc tend to have high variance or high bias, and to tackle this we use ensemble techniques like Bagging and Boosting. Although, there is one more non-ensemble method to tackle this problem called segmentation Algorithm that divides data on different clusters and then applies different models on those cluster.

Random Forest is a Bagging technique, which reduces high variability in the model. One question often arises here as to where do we apply Bagging, so to answer this we should know about stable and unstable models, lets understand this with an example, Decision tree usually over fit the model to the data, so as to have low bias but in that case it has high variance, thus a small change in the initial parameters causes the model prediction to vary a lot, which qualifies it as an unstable model. That is the reason we apply Bagging on only unstable models like Decision Tree, Neural Networks, etc.

The term bagging is coined by Leo Breiman to refer to Bootstrap Aggregating. Lets understand Random Forest step by step:

Step 1: Samples are taken repeatedly from the training data so that each data point is having an equal probability of getting selected, and all the samples have the same size as the original training set.

lets say we have the following data:

x= 0.1,0.5,0.4,0.8,0.6, y=1,0,1,0,0 where x is an independent variable with 5 data points and y is dependent variable.

Now Bootstrap samples are taken with replacement from the above data set.

Lets say the n_estimators is set to be 5(no of tree in random forest), then:

The first tree will have a bootstrap sample of size 5(same as the original dataset), assuming it to be : x1={0.5,0.1,0.1,0.6,0.6} like wise

x2={0.4,0.8,0.6,0.8,0.1}

x3={0.1,0.5,0.4,0.8,0.8}

x4={0.4,0.4,0.1,0.8,0.8}

x5={0.1,0.6,0.1,0.4,0.8}

Step 2: A classification or prediction model is trained each bootstrap sample drawn in above step, and a prediction is recorded for each sample.

Step 3: Now the ensemble prediction is calculated by counting the most votes in above step or by calculating the average in the above step.

Thus if the classifier or estimation model is unstable then ensemble helps in reducing the variance of the model by the way of averaging the prediction or by most votes but if the model is stable then prediction error arises from the bias the model have, and bagging will not be able to cover that because each bootstrap sample only have 63% of the original data on an average.

The downside of methods like random forest or any other bagging algorithms is their non interpret-ability.

--

--

Akhilesh Rai
MindOrks

Data Science Enthusiast|| IIM Alumnus|| AI Solution Designer || Data Lover