What are Random Forests and how do they really work?

Aman Arora
6 min readDec 26, 2018

--

Disclaimer: My ideas presented in this article include my personal experience and are also highly influenced if not directly quoted from the ideas presented by Jeremy in his Introduction to Machine Learning course at fast.ai.

I am an ardent follower of Jeremy Howard and a big fan of the way he explains the intricacies of Random Forests, which are otherwise thought and looked at as a Black Box.

When I first implemented Random Forests during the Data Science course from General Assembly in Aug-Oct 2018, which covered the breadth of topics in ML, it simply amazed me how creating Binary Splits on certain variables and values could give us such amazing results in practice! I was astonished to see how Decision Trees (weak classifier) can come together to form a strong classifier and how interpretable the results are!

The idea of Random Forests was first formally introduced by Leo Breiman in this research paper. Unfortunately, he passed away not many years after that, and therefore, not much was being written about Random Forests until about 3 years ago when the idea of practical Machine Learning took over. Theoretical formulas and explanation dominated the Machine Learning domain in the early 2000s and ideas such as Support Vector Machines were really popular as a lot could be written about them and it could be easily explained using Mathematical Formulas. I have myself made the mistake of going through all the Mathematical formulas and theory behind Support Vector Machines from MIT. While it is one of my favourite lectures from Patrick Winston, it did not help me much when I tried to implement a Machine Learning algorithm for one of my projects at General Assembly — Imbalance Classification for Rare CHD disease. To my surprise, Random Forests, an idea that I was unfamiliar to at the time worked really well and gave much better results than SVM. What was even more surprising that ExtraTreesClassifier, another modification of Random Forests worked even better! If only, I knew then what I know now.

So let’s get started!

What are Random Forests?

The first question that you might ask, what are Random Forests and why bother about them? Random Forests are a class of Tree Ensembles where decisions are made through a bunch of decision trees and not a single decision maker.
Think of it this way, in a classroom each student knows a little about the Universe and what each student knows is a little different from every other student. And when a teacher asks a question about the Universe, say the number of planets in a Universe, each student gives his/her best answer. The final answer is the average of all the answers given by each student. If the answers are very close to each other, we would say we have high confidence in the answer because each student is pointing pretty much to the same answer. However, for a question like the distance between the Earth and the Moon, answers might vary considerably and this is when we would say we have less confidence in our answers.
Mathematically speaking, if the standard deviation is high, confidence is low and vice-versa.

What are Random Forests made of?

Each Random Forests consists of Decision Trees and each Decision Tree consists of Binary Splits. The students in the above examples are the Decision Trees.

To explain, a binary split, I would need to show how a Decision Tree works. For this purpose, I have picked up the Ames Housing Data which can be found at this Kaggle Competition.

In the Kaggle Competition, we are predicting the ‘Sale Price’ of the house using a number of independent variables such as Overall Quality, Living Area, Total Basement Area etc. A complete list of variables can be found here.

If we implement a single Decision Tree on this data of depth = 3, to give us the best possible prediction of Sale Price, a Decision Tree creates a bunch of Binary Splits.

Ames Housing Data Decision Tree

Intuitively, it can be thought of as bringing similar houses together. That is, splitting expensive houses from the cheap ones such that better predictions can be made. Inside the Decision Tree, this process works using Information Gain. At each node, the DT predicts an average of the value of the samples present in that node.

Let me try and explain the Decision Tree above. Let’s start with the root node with 960 samples. This is the entirety of our training data at the moment a split has not been made yet. At this stage, our naive model, is predicting the mean of all the houses as our predicted value of all the houses in the Validation Set and this is giving us a Mean Square Error of 0.164.

The Decision Tree now decides to split on OverallQual at value 6.5. A left hand side and right hand side is formed where values in the training set whose OverallQual is less than 6.5 form the upper part of the branch and otherwise fall in the bottom part of the branch. 596 samples have OverallQual ≤ 6.5 whereas 364 samples have OverallQual > 6.5. Now the prediction for all houses in the validation set for OverallQual ≤ 6.5 is the mean of 596 values in the training set which is 11.8 whereas for the bottom branch it is 12.3. At this stage, the MSE for the upper branch is 0.083 and for the lower branch is 0.092. Therefore, with just one split there is a significant reduction in MSE.

Similarly, the Decision Tree continues to split until a single value remains at the base that is each observation becomes it own node unless specified otherwise.

How does the Decision Tree decide the Variable and Value of split?

So now we know that a Decision Tree makes multiple splits at each step and predicts the mean of the values falling in each node. The next question that you might ask, how does a Decision Tree decide which Variable to split on the first place and how is the Value of split decided?

How did the Decision Tree know to spit for OverallQual at 6.5?

To answer this, a Decision Tree tries all the splits of each variable at each value present for that variable. Therefore, our Decision Tree would have gone forward to split the data at OverallQual, GrLivBsmtArea, TotalBsmtSf and each other variable present at each value of the variable present and it chose the variable and value pair that gave the best information gain or the least error.

How is the Information Gain decided?

Now that we know, the best variable-value pair combination that gives us the most information gain is chosen, the next question arises how is Information Gain calculated?

Well, this may vary from ‘Gini’, ‘Entropy’, ‘RMSE’ or any other loss function that might be specified by the user who implements the algorithm.

Essentially, the one split that leads to the maximum loss in Error, is considered one which leads to maximum information gain and loss functions are decided by the user at the time of initialisation of Random Forests.

Hopefully, this gives you a better idea of how Random Forests work under the hood and it helped you as much as it has helped me realise the power of this idea. Random Forests generalise really well however, Random Forests also have a downside. It is very hard for a Random Forests to predict for data that lies outside the range of data it has seen before. I shall be talking about this problem of Extrapolation in my next article.

Please feel free to reach out to me via linkedin, should you have any questions regarding any explanation presented above or have any suggestions to make it better.

--

--