In this post, we are going to talk about one of the most important algorithms in machine learning, which is called Decision Trees. They are important because without them other state-of-the-art algorithms like Bagging and Boosting wouldn’t have been invented.
First, Let’s talk about the Pros and Cons and then paraphrase them one by one as we go along.
1- Can be used for both classification and regression
2- Can be displayed graphically therefore they are easily interpretable
4- Features don’t need scaling
5- Automatically account for interactions
1- Tend to not perform very well compared to the state of the art ML models
2- Recursive binary splitting makes “locally optimal” decisions that may not result in a globally optimal tree (at each step it decides what is the best split, there might be different combinations that result in an overall better result but it wouldn’t know that)
3- Easy overfitting
Let’s discuss the advantages in more detail:
1- Decision Trees can be used for both regression and classification.
This is a scatter plot of 2 feature dataset:
The idea of a decision tree is to find out how to split your dataset into different parts. To do that we have to divide the space into distinct and non-overlapping regions. The values of the predictions that you can see on the regions are the mean values of each region derived from terminal nodes. (I will talk about these terminologies shortly)
For instance, look at the decision tree in the following picture
The first thing it will take into consideration is that whether X2 is less than 0.302548 or not? If yes, is X1 less than 0.800113 or not? And if no, is X1 less than 0.883535. In addition, the questions asked are all in a True/False form. For each true and false answer, there are separate branches. No matter the answers to the questions we eventually reach a prediction (leaf node). Start at the root node at the top and progress through the tree answering the questions along the way.
The first question that comes to mind is that how to decide on the best splitting decision? here is where the cost function can help us.
The cost function used to train regression trees is Mean Squared Error (aka MSE) and the formula is:
The best split is decided where the overall MSE is lowest:
I know these formulas are a little bit intimidating, but let me explain how they work. After each split, we take the average value for both sides of the split (R1 and R2) and we come up with a value (ŷR1 for R1 and ŷR2 for R2). Next, we are going to subtract all of our data points in each side from the average value of that exact side and square it (for R1 it would be (yi — ŷR1)^2), and then sum up all values. Wherever the overall MSE is lowest that’s the best place to split.
Note: the approach is top-down and greedy. Top-down means you are going from big to small and you can only split regions that have been split. Therefore, each split gets smaller and smaller (if you have a lot of splits you tend to overfit)
Greedy means it’s going to decide where the best split is at only that particular moment in time.
the difference between classification and regression is the nature of your output. In Classification your outputs are categories or discrete variables but in regression, you are predicting a continuous variable.
This is again a scatter plot of only 2 features of the iris dataset:
The X-axis shows petal width and Y-axis shows petal length and our target variable is the species of flowers which shows by color (3 shades of purple for 3 species)
The question is what is the best split? The question again leads us to two cost functions for classification trees. First is the Gini cost function which ranges from 0 to 0.5 and the second is Entropy which ranges from 0 to 1. The main idea of these two cost functions is you are trying to maximize purity after the split. It is going to try different splits and measure the Gini or Entropy with each split and determine where the best split is.
Splits using Gini:
On the scatter plot would be:
2- They can be displayed graphically. Therefore, they are easily interpretable. If you don’t have a super deep tree it can be easily visualized. Since we are talking about how we can graphically display Decision Trees let me introduce you to some useful terminologies of Decision Trees:
- Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
- Leaf/Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
- Branch / Sub-Tree: A sub-section of the entire tree is called a branch or sub-tree.
- Parent and Child Node: A node, which is divided into sub-nodes is called parent node of sub-nodes whereas sub-nodes are the child of parent node.
3- Non-parametric means this algorithm does not make any assumptions of the underlying distribution of data.
4-Features don’t need scaling because it is looking for where you can best split your data, so the scale of your data really doesn’t matter.
5- It automatically accounts for interaction because whenever is trying to make a decision it would be between two features of our dataset, so with a combination of different features which are above or below certain results you end up with a different prediction (randomness involved).