Why Implementing Random Forest Over Decision Tree?

Amjad El Baba
4 min readJan 5, 2022

--

“You’re either at the table, or on the menu.”

Comparison is a must regarding algorithms efficiency & effectiveness in achieving business goals.

Algorithms are evolving, and for sure every period of time a new one will be invented or optimized for the sake of helping us pass-up or replace a previous algorithm we are currently using.

You can access the resources I had explore from here & here.

If you don’t know decision trees please check it in this article.

You will see a sum-up on what I learned from those resources & and an elaboration on random forest algorithm so you can find the topic much easier.

Random Forest VS Decision Tree

towardsdatascience.com

At the very first, you should now that random forests are built from decision trees.

“Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy” — The Elements of Statistical Learning

So decision trees are perfect when we apply our training data on them, but it show some inaccuracies when exposed to a new data(testing data). While, one of the main pros of random forests is that it combines the simple workflow of decision trees + the ability to improve accuracy.

The question now: What is the mechanism a Random Forest follows?

Step 1: Bootstrapped Dataset Creation

users.sussex.ac.uk

Let's say we have this dataset, we will randomly select records from original dataset in order to get a new one(bootstrapped) with the same size of the original. Note that we can utilize the same record more than once.

1 yes no  small infected 
2 yes yes large infected
3 no yes med infected
1 yes no small infected
4 no no med clean
6 no no large clean

We ignored the 5th sample and used 1st sample twice.

Step 2: Using a random subset of variables at each step in order to create a decision tree out from the bootstrapped dataset

Here we will take Writable & Updated as an example and create a decision trees out from them, and will repeat the choosing of some features at each node of the tree.

Now we should iterate on step 1 and step 2 to generate multiple trees.

Step 3: Using the random forest we just created

To test it, we need a new sample to feed it with.

For simplicity reason, let’s assume the we have only 4 trees.

1st tree: predicts => infected

2nd tree: predicts => infected

3rd tree: predicts => clean

4th tree: predicts => infected

We keep tracking each tree prediction to assign the highest appearing prediction to our test sample, infected got 3 and clean got 1, so the final prediction will be infected.

Note: bootstrapping data + using the aggregate to make a decision is called Bagging.

Step 4: Evaluating

If you remember from the step1, we ignored the 5th sample and didn't enter in the 1st bootstrapped set.

Note: this is called Out-of-Bag Dataset.

which is:

5th record

To evaluate the random forest we just made, we will evaluate it based on records that didn't present in a bootstrapped dataset and check if it will correctly classify(in our case we are predicting the CLASS).

As before, Let us assume that we have only 4 trees:

1st tree: predicts => clean

2nd tree: predicts => infected

3rd tree: predicts => clean

4th tree: predicts => clean

3/4 predicts it correctly, so yeah the random forest correctly classifies the out-of-bag sample!!

Same criteria is applied on the remaining out-of-bag samples.

As a conclusion, we can figure out how accurate the random forest is by the proportion of correctly classified out-of-bag samples by this random forest.

Note: the proportion of out-of-bag samples that were wrongly classified is called Out-of-Bag Error.

Back to step 2, we took Writable & Updated as an example in order to create multiple decision trees, and in step 4 we calculated the out-of-bag error of our random forest created. Long story short, if we want to boost the accuracy of our forest we can repeat step 2 by changing the number of used features, for example 3 or 4 instead of 2.

Then, we can choose the most accurate random forest based on the lowest out-of-bag error.

To sum-up, that's why using a random forest is much better & accurate than a decision tree, since having multiple toys to play with is much better than a single one.

Thanks for your time and let’s boost our knowledge!

--

--

Amjad El Baba

An AI engineer with a passion for writing, always curious and eager to share what I learn. I enjoy taking ideas and turning them into something relatable.