Why Implementing Random Forest Over Decision Tree?
“You’re either at the table, or on the menu.”
Comparison is a must regarding algorithms efficiency & effectiveness in achieving business goals.
Algorithms are evolving, and for sure every period of time a new one will be invented or optimized for the sake of helping us pass-up or replace a previous algorithm we are currently using.
You can access the resources I had explore from here & here.
If you don’t know decision trees please check it in this article.
You will see a sum-up on what I learned from those resources & and an elaboration on random forest algorithm so you can find the topic much easier.
Random Forest VS Decision Tree
At the very first, you should now that random forests are built from decision trees.
“Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy” — The Elements of Statistical Learning
So decision trees are perfect when we apply our training data on them, but it show some inaccuracies when exposed to a new data(testing data). While, one of the main pros of random forests is that it combines the simple workflow of decision trees + the ability to improve accuracy.
The question now: What is the mechanism a Random Forest follows?
Step 1: Bootstrapped Dataset Creation
Let's say we have this dataset, we will randomly select records from original dataset in order to get a new one(bootstrapped) with the same size of the original. Note that we can utilize the same record more than once.
1 yes no small infected
2 yes yes large infected
3 no yes med infected
1 yes no small infected
4 no no med clean
6 no no large clean
We ignored the 5th sample and used 1st sample twice.
Step 2: Using a random subset of variables at each step in order to create a decision tree out from the bootstrapped dataset
Here we will take Writable & Updated as an example and create a decision trees out from them, and will repeat the choosing of some features at each node of the tree.
Now we should iterate on step 1 and step 2 to generate multiple trees.
Step 3: Using the random forest we just created
To test it, we need a new sample to feed it with.
For simplicity reason, let’s assume the we have only 4 trees.
1st tree: predicts => infected
2nd tree: predicts => infected
3rd tree: predicts => clean
4th tree: predicts => infected
We keep tracking each tree prediction to assign the highest appearing prediction to our test sample, infected got 3 and clean got 1, so the final prediction will be infected.
Note: bootstrapping data + using the aggregate to make a decision is called Bagging.
Step 4: Evaluating
If you remember from the step1, we ignored the 5th sample and didn't enter in the 1st bootstrapped set.
Note: this is called Out-of-Bag Dataset.
which is:
To evaluate the random forest we just made, we will evaluate it based on records that didn't present in a bootstrapped dataset and check if it will correctly classify(in our case we are predicting the CLASS).
As before, Let us assume that we have only 4 trees:
1st tree: predicts => clean
2nd tree: predicts => infected
3rd tree: predicts => clean
4th tree: predicts => clean
3/4 predicts it correctly, so yeah the random forest correctly classifies the out-of-bag sample!!
Same criteria is applied on the remaining out-of-bag samples.
As a conclusion, we can figure out how accurate the random forest is by the proportion of correctly classified out-of-bag samples by this random forest.
Note: the proportion of out-of-bag samples that were wrongly classified is called Out-of-Bag Error.
Back to step 2, we took Writable & Updated as an example in order to create multiple decision trees, and in step 4 we calculated the out-of-bag error of our random forest created. Long story short, if we want to boost the accuracy of our forest we can repeat step 2 by changing the number of used features, for example 3 or 4 instead of 2.
Then, we can choose the most accurate random forest based on the lowest out-of-bag error.
To sum-up, that's why using a random forest is much better & accurate than a decision tree, since having multiple toys to play with is much better than a single one.
Thanks for your time and let’s boost our knowledge!