Disclose the Secret of Randomness in Random Forests
Random forest is a technique of machine learning algorithm that operates by constructing multiple Decision trees during the training process.

Use cases of Random Forests :-
- Detection of fraud and loyal customers in banks.
- Helps in predicting diseases by analyzing patient’s medical reports.
- Stock price prediction.
How random forest algorithm works ? →
As mentioned random forest is a collection of decision trees, at first when we get the dataset we will divide the dataset to n equal parts. These n equal parts are known as n-estimators for random forest algorithm.
Now , we will start constructing decision trees for each sub-dataset of main dataset. Thereby, while constructing decision trees we must know that at 1st label, all the attributes or, features of the dataset will be at root label. So, from there we have the choice of selecting the best attribute from all the attributes of the dataset. For this, we have two specific algorithms → one is Information Gain and the other is Gini Index. We will be using Gini Index method here to select the best attribute out of the all attributes for making decisions in decision tree.
So, let’s take this dataset for understanding random forests.

So, the formula for calculating Gini Index is : →

where,
- p() refers to probability of,
- i refers to different groups present in the feature column
- t refers to total no. elements present in the feature column
Gini Index → Gini index is the process of selection for the best attribute , among the other attributes present in the dataset, by the help of which we can generate further sub trees. It measures the impurity or, inequality of column attributes for the whole dataset . The particular group having the lowest impurity will be chosen for representing the gini index value for that feature column. A gini score of 0 is the worst case and ideally we want the score to be 1 . So, we will pick the highest gini value among the attributes for generating sub trees.
So, after getting the dataset we should divide our dataset in n-equal parts. And for each part we will construct a decision tree. As we took a small dataset here, we will construct 3 different types of decision trees for our dataset.
So, for constructing our first decision tree we will start from calculating the Gini Index of the dependent variable in the dataset which is “LABEL”. Thereby , Gini Index of column LABEL is 1-(2/6)²-(2/6)²-(2/6)² = 0.66 as there are 2 LEMONS, 2 APPLES and 2 GRAPES out of 6 LABELS in our dataset. Now we will start calculating best Gini Index for each column. For that , we have to divide the elements of each column in a group of two. So , for “COLOR” column the possible subsets can be 2³ = 8. As Gini Index uses binary split calculation for each attribute and 3 because COLOR column has only 3 different attributes that are RED, PURPLE, YELLOW. So, out of all 8 possible sub-splits the binary sub-splits that are possible are → {(RED , PURPLE ), YELLOW} , {(PURPLE , YELLOW) , RED } , {(RED , YELLOW), PURPLE } . Now, we have to calculate the Gini-Index of each binary sub-split through formula:

where, if we take the split {(RED , PURPLE ), YELLOW} first,
then D1 refers to (RED , PURPLE ), D2 refers to YELLOW and D refers to the no. of training examples
Therefore, gini Index of {(RED , PURPLE ), YELLOW} is (4/6)*(1-(2/4)²-(2/4)²)+(2/6)*(1–(2/2)²) = 1/3 , for calculation check this video: Gini Index calculation in detail
After calculation of Gini Index of three classes ( {(RED , PURPLE ), YELLOW} , {(PURPLE , YELLOW) , RED } , {(RED , YELLOW), PURPLE } ) take the class with lowest Gini Index value among the three. Similarly, calculate the lowest gini Index for other features of the dataset by using binary splits. Finally among the features , the feature having the highest gini index value will be selected as the root node for the purpose of classification.
Simultaneously , in next steps we will keep repeating the same steps for gini index calculations for next best feature to classification and constructing the next node of decision tree leaving the earlier used features for classification.
So we will get our first decision tree as:

And when we repeat the same steps, the other 2 parts / types of decision trees that will be formed from the sub-parts of our dataset are:


Now, if we want to classify a new fruit with to be classified among APPLE , LEMON or , GRAPE (NOTE:- LEMON is represented by orange and GRAPE is represented by cherry in the pictures ) class then we will iterate through its specifications over the 3 different trees to get our answer and the class that will get maximum weightage will be the answer for that new test object.


As Orange got the maximum votes from the predicted answers of the decision trees. So, the correct prediction for the test object is LEMON or, orange.

This is how the random forest algorithm works with a bunch of decision trees. I hope you have enjoyed reading this blog. If you have any comments, queries or questions, then please let me know in the comments section. Until then enjoy learning.

