Decision Trees and Random Forest — Just If-Else Repeatedly

10 min readMay 25, 2020

When it comes to Machine Learning, we never forget to talk about Foresting algorithms as they have a very different kind of solving pattern than other algorithms. Foresting Algorithms can solve for both Regression as well Classification problems. Not only that, in order to achieve accuracy, we can improve the Foresting Algorithms in additive with some Boosting Techniques(like Adaboost, XGBoost, Gradient Boost ….). In this article we will go through the basics of Foresting Algorithms and their approaches towards Classification and Regression.

ANALOGY:

Before going into Decision Trees, lets make ourselves what they really mean and how they actually work. Consider, that someone is giving you a problem to solve. The problem is stated as follows:

In a random health study it has shown that people who eat pizza are known as Unfit and people who eat Spinach are known as Fit. Write a program to display if a person is ‘Fit’ or ‘Unfit’ based on their food.

For this problem we would code something simple like this:

#python codefood = input()
if(food == 'pizza'):
   print("Unfit")
elif(food == 'Spinach'):
   print("Fit")

We have finished this code in a simple if-else condition. Now, what if you need not code this and the machine itself generated those if-else statements.Yes, Foresting Algorithms basically work in this fashion.

Also now coming to a practical situation, you cannot confirm that a person is Fit or Unfit based on only these two foods. Also other parameters, like age, sex, exercise duration, meditation,PSS level and others contribute to the prediction. Hence, in that case we would not be in a situation to understand the analysis and code just like the above case. In that case, the Foresting Algorithms gives a good hand. So, the following image would mean what a decision Tree is.

This is for our understanding. In general, Decision Trees would be more complex. This simple case would make us clear on what Decision Trees are.

Decision Trees for Regression

Consider the graph below

Suppose we need to predict the drug effectiveness based on the dosage of the drug, we obtain a Decision Tree as follows:

The Decision Tree algorithm has taken the thresholds(<14.5, ≥29, ≥23.5) based on the dataset(as shown in the graph).

Case 1 (Dosage<14.5):

The leaf value =4.2%, is calculated based on the average of all the data points that are in this range Dosage <14.5.

Case 2 (Dosage≥29):

The leaf value = 2.5%, is the average of all the data points in this range where Dosage >14.5 and ≥29

Case 3 (Dosage ≥ 23.5):

The leaf value = 52.8 ,is the average of all the points which are in the range Dosage> 14.5 and ≥ 23.5 but ≤ 29. Similarly 100% is calculated which is the average of data points which are in the range Dosage> 14.5 and ≤29 and ≤23.5 .

But how are we selecting these Threshold values?

Here the algorithm starts picking the thresholds based on the data points. In the beginning the algorithm takes two points having low dosages. Then their average drug effectiveness is measured and it is found to be 3. The points on both the sides of the threshold correspond to an average. Hence for dosage less >3 , we get 0% effectiveness(only one point in the left as shown in the diagram below) and for dosage >3 we get 38.8 % effectiveness(which is the average of all points in the right side of the diagram below).

Now we must calculate the Sum of the Residuals(Sum of the squares of observed value minus predicted value) for each point which is given as :

Now calculating for our case we get the following:

Now that’s not the end. Now take the next two points, and set their average as threshold. Calculate the residuals. The next step would be as follows:

In a similar fashion, we need to calculate the sum of the squared Residuals for all the thresholds. After finding the residuals for all the thresholds, pick the one having the least sum of squared residuals.

Hence 14.5 is set as the threshold and the split is made.We can keep splitting and make the tree as depth as possible. But that would cause a very high variance state where the training set would be perfectly fitted but when it comes to the test set, the model would fail to answer.

Hence to avoid that situation we can restrict a split having certain minimum number of data points in a particular split. Suppose if we choose 7 as the the minimum number of data points , then the algorithm will not split the data for points having below 7. Eg: consider our case

The left side will not be splitted further as there are only 6 data points. But in the right side, there are more than 7 points , hence wen need to spit them further. Hence this increases the depth of the tree and finally obtain a tree as in figure 3.

What if there are more than 1 independent variable? What variable to choose for the decision node?

When there are more than 1 independent variable, we need to choose the one which helps to take the best decision.

Consider that we have Dosage, Age and Sex as our independent variables. We need to repeat the same above steps and find the Sum of Squared Residuals for each of these independent variables. Finally choose the one which has the lowest Sum of squared Residuals.

In our case the SSR (Sum of Squared Residuals) of Age is less. Hence it has to be chosen as the decision node and also choose the threshold which gave this least residual.

Decision Trees for Classification

This is similar to the Regression Trees but comparatively easier. There are several algorithms to decide on which independent variable to be taken more importantly (meaning, the decision node). Some of them are:

ID3
Gini index
Chi-Square

But we will focus only on Gini Index.

Consider the following data:

Consider predicting if a person has Heart Disease or not based on the independent variables:

Chest Pain
Good Blood Circulation
Block Arteries

We need to find which one will be considered as the decision node.

Case 1:

Consider that Chest pain is chosen as the decision node. Just check the number of people who

got Heart Disease having Chest Pain.
did not get got Heart Disease having Chest Pain.
got Heart Disease not having Chest Pain.
did not get Heart Disease not having Chest Pain.

Confusing? Check the following diagram. Left node represents the people who got Chest Pain. Right node represents the people who did not get Chest Pain.

GINI INDEX:

It is given by the following formula

Gini index = 1- p²- q²

‘p’ refers to the ‘Probability of yes’ and ‘q’ refers to the probability of ‘no’.

Hence for the left node the Gini Index is measured as

Similarly Gini Index for the Right node is measured as

Now we need to find the Gini Index for the Chest Pain ,which is calculated as

Hence, we have done this for Chest Pain. Similarly we need to do for all the independent variables . We obtain the following:

Among all the three, we choose the one having minimum Gini Impurity(Index). Here Good Blood Circulation is chosen as the Decision Node.

Now we need to repeat the following steps to perform the splits for the tree:

Calculate all of the Gini Impurities
If the node itself has the lowest score, there is no point in separating the patients any more and it becomes a leaf node.
If separating the data results in an improvement, then pick the separation with the lowest impurity value.

Here we handled with the data which has columns that were classes( meaning that chest pain, good blood circulation and block arteries take classes ‘Yes’ or ‘No’). What if it was something having continuous values(Eg: Weight of a person). I would like to leave it to the reader to explore it.

What makes a Random Forest?

Now so far, we have discussed on building a decision tree. But do you think this algorithm would always do its best . No, we can broaden our approach. Just think about a dance show being conducted in a TV. The dance is performed by a variety of people from various countries. The show also has 5 Judges to evaluate the best Dance team. But why is the show having 5 Judges? It can have only one right? I hope you got the point. We want to have the show/competition to be fair, meaning the decision from more than one person will ensure a good judgement than the one decided by only one Judge.

Similarly we need not stick to only one Decision Tree. But can create a group of Decision Trees which is called as the Random Forest. Just as the Judges in the show, these Decision Trees together help to make good decision(i.e. prediction)

Why is it called Random Forest?

Of all the training examples in the Dataset, we first take random subset of them. It is also called as the Bootstrapped Dataset. Now after creating this bootstrapped dataset, we need to create a decision tree. But, this time we will not be using all the independent variables to create the tree. Rather we take only a random subset of them. Now perform the same steps as discussed above for creating a decision tree.

For each tree, we create a random bootstrapped dataset, and based on that we create the decision tree.Now keep repeating the above steps for n number of trees(generally 100 trees is a good number). Our Random Forest is ready.

Now time to evaluate:

Now we take a sample and test it with our random forest. This time we count the votes of each tree and finally conclude the predicted class having maximum votes as shown below:

Out of 100 trees, 82 trees have predicted that the patient will have heart disease. Hence, our random forest algorithm will predict “Yes” in this case. This is how a random forest works.

Why to use Boosting Algorithms?

Till now we have spoken about Decision Trees and Random Forest. But that’s not the end. I had initially said that Foresting Algorithms are not like other basic Machine Learning algorithms. Yes, the algorithm has several details to be discussed.

The first thing is that, we can create a Random Forest in a situation that may exactly fit our training set. But, does very bad in predicting new data. This is a quite often problem in Foresting Algorithms. This must be avoided. This can be done in several ways. We can increase the number of data points per split or reduce the depth of the graph and fine tune several parameters.

But a common way to sort out such problems is to go with Boosting Techniques. The major problem with random forests, is that each decision tree is independently built of another.(i.e. all the trees are built randomly and there is no relation between one tree with another).

But in boosting algorithms a decision tree is created by learning with the error rate of the previous decision tree. There are several boosting techniques such as :

AdaBoost
Gradient Boost
XGBoost

and a lot more. By the way ,people use these kind of boosting techniques to win Kaggle competitions . Hence these Foresting algorithms in additive boosting helps to achieve accuracy for some challenging datasets.

Conclusion

It is very important for us to know the way to crack the problem. It all starts from the Dataset. Giving a good visualization and understanding the data is very important. Once we have understood the key points of a dataset, we need to decide a good algorithm for it . Once we have decided the algorithm , we are done . Just preprocess the data and build the model.

“Start playing with the data. Follow the norms of ML ,then the predictions become real”

References:

StatQuest with Josh Starmer

Statistics, Machine Learning and Data Science can sometimes seem like very scary topics, but since each technique is…

www.youtube.com

Decision Tree. It begins here.

It’s time to begin the journey

medium.com