Random Forest — A Model Designed to Provide Structure in Chaos
Picture this — you need to make prediction based on a massive amount of features. What algorithm comes to mind first? For me, I usually always go with random forest. When faced with the problem of over-fitting, the machine learning technique that comes to rescue (more often than not) is random forest again. When we want an easy solution to a problem that a greedy algorithm causes, random forest seems to be the answer.
Until not too long ago, random forest was one of the most widely used predictive modelling techniques in various data science competitions. Of late, the boosting algorithm, XGBoost has taken over but random forest remains a very useful technique to know and use.
This should be enough to convince any individual aspiring to make inroads into the Data Science world, to have random forest in his/her tool kit.
What is Random Forest then? Why is it considered to be so good?
There are tons of material in circulation explaining the concept of random forest. My attempt here is to explain the concept but with a touch of simplification by juxtaposing it with a few social science scenarios. Those who are new to data science, can get intimidated by the technical deluge and lose interest. Hence I am attempting to give a snapshot with a few relatable comparisons. Once the context is set, we can tear into the technicalities.
Table of Contents
- Random Forest and Democracy - A Casual Comparison
- Rome was not built in a day - Decision Tree, the bedrock of Random Forest
- Decision Tree - The Unfinished Solution
- Random Forest - The Messiah
- Case Studies
1. Random Forest and Democracy - A Casual Comparison
One will be surprised to know that random forest’s modus operandi seems to have a lot of similarities with how a democracy functions. Democracy seems to have highest resonance with the phrase,“Structure in Chaos”. Random Forest, as the name suggests, is a forest of hundreds of decision trees, but there is a structure and method as to how it operates.
Conceptually, what are the major ideals, that democracy attempts to achieve and which aspect of random forest finds a resonance with it?
- In democracy, governance is decided by the majority opinion. Therefore the final decision is made on the basis of voting, unlike the case in a Monarchy or Dictatorship. The parallel in random forest is that, the predictive model is built with multiple decision trees and the outcome from each tree is aggregated to decide the final outcome.
- In a democratic set up, like let’s say India, the elections occur all over the country spreading across many constituencies. Considering the diversity of the population, it is not possible for a party to win the majority by having a manifesto that appeals to only a few sections/communities/cultures of the country. Thus democracy tests every party’s manifesto. I would say that if a manifesto is favorable to only a few sections of the society, it is not inclusive in nature, as the aim is to win the election without looking into the needs of the entire population. Hence if a party’s manifesto is comparable with a model in data science, the manifesto resonates with the cross validation concept used in the random forest model. The key goal of cross validation is to prevent any greedy algorithm(The concept of greedy algorithm explained in latter sections) to take over. Cross-validation is a statistical method used for assessing how the results of a statistical analysis will generalize to an independent data set. In the absence of cross validation, the model output can be very dependent on the data on which the model was trained.
- As mentioned earlier, in a democratic set up like India, elections occur in many constituencies in every state. So the election result in each constituency has a role to play in the final election result. Random forest is an ensemble of decision trees, where the decision from each tree is accounted for in the final decision of prediction for each observation. The elections are synonymous with decision trees in random forest i.e. the effort to elect the ruling party is an ensemble of many elections (individual decision trees) at a constituency level.
- In Lok Sabha elections, no candidate can contest from more than two constituencies. This removes possible bias from voters’ mind towards a popular candidate and helps in establishing accountability with all contesting members to perform. In Random Forest, not all predictors are considered in each decision tree built. A subset of the predictors is randomly picked by each decision tree for prediction. This helps in all predictors making their significant contribution towards the final decision.
Random forest thrives even in scenarios, when there is an abundance of chaos, i.e. many predictors. It’s hard to know which predictor is important and which is not. All the other traditional statistical techniques might fail or struggle when we have an incredibly high number of independent variables.
2. Rome was not built in a day - Decision Tree, the bedrock of Random Forest
The context having been set with a very generic understanding of random forest, we will now move to technical intricacies of the algorithm. Building a forest means, building many individual trees, as a conglomeration. Hence, it is imperative to understand the working of a decision tree to understand the modus vivendi of random forest.
In real life, the decision making process is generally a subconscious step-wise activity, in which, each option (which is equivalent to a feature/predictor in data science parlance) is weighed upon with it’s pros and cons and the option that gives the best output is chosen. In other words, it’s a step-wise process, where at each step, the factors that give a better separation amongst all levels of decisions, are chosen for further analysis, until the final decision is made. Here, we see a real life example(on left) of a decision tree, where the Levels of Decision are Yes (I should Quit) and No(I should Not) and based on different situations, the answer is different for the individual.
In data science, the decision tree algorithms work towards splitting the data into homogeneous groups. This is achieved by selecting the predictor that helps in getting the best split. It works on both categorical and regression problems. The criterion that are generally used to identify the significant predictors, which ultimately help in splitting the data are “Entropy”, “Information Gain” and “Gini Index”. The purpose of all these three techniques is to identify the best split which helps us move towards a clear separation between various levels of the target variable. The moment we have a split which cannot be further homogenized, we have our decision for that branch. In a decision tree, we can have multiple branches which ultimately help in defining the rules that gives the information required for the prediction. For a more detailed technical explanation of how these concepts work, please refer to a very interesting blog I found in Medium itself, Decision Trees.Decoded, written by Akhil Gupta.
I am providing a small example of how the decision tree works, which when read with the above referenced blog, will help in better understanding. See below. For this I am using the famous Titanic dataset which has captured the details of passengers who were traveling in the ship and met with the fatal accident mid sea. See below the R section which ends with a decision tree that can be used to predict whether a passenger died or survived, if we had the information about the rest of the variables.
3. Decision Tree - The Unfinished Solution
There are many advantages associated with how the decision trees operate, but individually, decision trees introduce many problems. With these problems, we rarely find a robust predictive model. Some of the reasons why I say so are given below as disadvantages:
I. Very unstable since a variation in the data-set can lead to a very different tree taking shape. The tree is very dependent on the training data being used.
II. Over-fitting is a common problem with decision trees. They follow the pattern of the training data too closely, which cannot be replicated by all data-sets, resulting in poor performance on unseen data.
III. They are prone to get affected by greedy algorithm, since the goal at each node is to find the best split at that node. We may not find the globally optimum tree. As a result of the greedy strategy applied by decision trees in finding the right starting point of the tree, the final result can be greatly impacted. i.e small changes early on can have big impacts later.
IV. Decision Trees do not work well if you have smooth boundaries. i.e they work best when you have discontinuous piece wise constant model. The boundaries drawn by trees are always parallel to X-axis (or Y-axis). If you truly have a linear target function decision trees are not the best.
4. Random Forest - The Messiah
Individually, decision trees are not the best choice for a predictive model, as was seen in the earlier section. However the strengths that can be derived from individual trees can be aggregated to give a very robust model. That is precisely what Random Forest does. Random Forest is essentially an ensemble of many trees, that beats all negatives highlighted in the previous section. The working of a Random Forest can be elaborated in a few steps, as explained below.
Step 1: Random Forest works by building multiple trees, where each tree is built by taking bootstrapped training samples from the original data. Bootstrapping is a type of resampling where large numbers of smaller samples of the same size are repeatedly drawn, with replacement, from a single original sample. See a simple visual representation of bootstrap sampling on the left. This is the first introduction of randomness to the model.
Step 2: Each tree is built by taking a subset of the predictors available in the original data. This is second randomness introduced to the algorithm. While building these decision trees, each time a split in a tree is considered, a random sample of “m” predictors is chosen as split candidates from the full set of “p” predictors. The split is allowed to use only one of those m predictors. A fresh sample of m predictors is taken at each split, and typically m ≈√p in a classification problem. The defalut ratio of m:p in a regression problem is 1/3.
Step 3: Classifying
Based on ’n’ sample, ’n’ trees are built. Each record is classified by all the trees built. So with ’n’ trees built, we will have ’n’ predictions for each record. In a classification problem, the final class for each record is decided based on voting i.e. the mode of the predictions is the final prediction. In the example below, Tree1 is built with observations inside the green boundary. Similarly Tree2 with observations highlighted in orange and Tree3 with observations inside purple boundaries. Each of them have different predictors. The final output, for the first row of observation, as you can see, is the mode of the 3 predictions achieved from the three trees.
Advantages gained from Random Forest:
- Cross Validation comes free with Random Forest. By default random forest picks up 2/3rd data for training and rest for testing for regression and almost 70% data for training and rest for testing during classification. Cross Validation helps in preventing over-fitting as the model generalizes with experiences from multiple samples of data.
- Since at each node a subset of the features are selected for splitting as explained above, the trees are generally de-correlated. This also helps in preventing any greedy algorithm from taking over, since, if there was a very strong predictor, it will not be there in all trees to dominate the result.
- Variance that can be exhibited by individual trees is vastly reduced in random forest by averaging the results from all trees. The variance is expected to be reduced by 1/n, where n is the number of trees grown.
- We can get the relative importance of all the predictors in terms of how they affect the target variable. This in turn helps in feature selection. We can see it in the case studies provided in the next section.
- Decision boundaries can become much more accurate with more flexibility. See below an example with decision boundaries drawn by a decision tree and by a random forest model on the same dataset. The miss-classifications seen in a decision tree have mostly got addressed in the random forest.
5. Case Studies: Two classification case studies taken with data-sets with extremely high number of predictors.
Case Study I — Prediction of Human Activity based (560 predictors)
The data set used for the project is collected from recordings of 30 human subjects captured via smartphones enabled with embedded inertial sensors. Many machine learning courses use this data for teaching purposes. This is a multi-classification problem. The data set has 10,299 rows and 561 columns.
For the dataset, 30 people were used to perform 6 different activities. Each of them was wearing a Samsung Galaxy SII on their waist. Using the smartphone’s embedded sensors (the accelerometer and the gyroscope), the user’s speed and acceleration were measured in 3-axial directions. The sensor’s data was used to predict user’s activity. The user activities could be one of the six below:
Walking, Walking — Upstairs, Walking — Downstairs, Sitting, Standing and Laying.
The data can be imported from the following UCI Machine Learning repository. Dataset. The case study along with the R code and applicable commentary is below
Case Study II — Building a random Forest Model to Predict Customer Behavior — Customer will purchase the product or not (85 Predictors)
This case study is worked out by taking a data set from a very popular book, Introduction to Statistical Learning (ISLR). The data contains 5822 real customer records. Each record consists of 86 variables, containing sociodemographic data (variables 1–43) and product ownership (variables 44–86). The sociodemographic data is derived from zip codes. All customers living in areas with the same zip code have the same sociodemographic attributes. Variable 86 (
Purchase) indicates whether the customer purchased a caravan insurance policy. Further information on the individual variables can be obtained at http://www.liacs.nl/~putten/library/cc2000/data.html
Apart from learning how to build a random forest model, this case study helps us with methods to deal with imbalanced data. I used SMOTE technique to balance the data. I referred to an Analytics Vidhya blog, Practical Guide to deal with Imbalanced Classification Problems in R, to deal with imbalanced data-set problem. The workings in r to build the model alongwith associated commentary is provided below.
We saw two examples of how random forest comes to rescue of analysts who would have struggled to deal with such a high number of independent variables, in the absence of it. We also saw how random forest uses decision trees to build a very robust model. In the second case study, we tackled the problem that can be seen when we have an imbalanced dataset. The accuracy rate achieved is around 87%, which can be probably further improved by using a boosting model on a random forest model. We also saw that random forest can be an extremely useful tool for feature selection. This helps in eliminating unwanted features from the final model.