Decision trees are some of the most popular ML algorithms used in industry, as they are quite interpretable and intuitive. Indeed, they mimic the way people logically reason.
The basic recipe of any decision tree is very simple: we start electing as root one feature, split it into different branches which terminate into nodes, and then, if needed, proceed with further splitting on other features. Finally, we go back and “prune the leaves” to try to reduce overfitting.
In this article, I’m not going to dwell on the final procedure of ‘pruning leaves’, since I want to focus on the splitting criteria: how can your algorithm decide which feature needs to be elected as root? There are different answers to this question, depending on the decision tree algorithm you are going to employ. Here, I will explain the typical criterion of ID3 decision tree, that is the Information Gain criterion.
Hence, we need to first introduce the concept of information.
Information is determined by observing the occurrence of an event. Namely, given a random variable X with possible outcomes x1,…….,xn and their respective probabilities of occurrence p1,…….,pn, the information we want to gain is the value of that variable X.
The size or amount of information we are provided with is measured by bits, and there is a very intuitive formula to derive that amount.
Indeed, given a random variable X and one of its possible output xi with probability pi, then the information content of a message specifying X=xi is:
If we dwell on this formula, we realize it makes an intuitive sense. Let’s consider a certain event, with a probability of occurrence equal to 1.
It means that the amount of information you need in order to know the value of our certain occurrence is equal to zero, and it makes sense since it’s a certain event!
Now if we had many events x1,…….,xn with probability p1,…….,pn what is their mean information? It’s simply the average of the information content of each observation weighted by the probability of its occurrence:
Now that we introduced the concept of entropy, we can define the splitting criterion of our tree. Note that there are different strategies you can build your tree, depending on the algorithm you are going to employ. However, here I will explain the typical criterion of ID3 decision tree, that is the Information Gain (IG) criterion, since it is, in my opinion, the most intuitive.
The idea of Information Gain criterion is that we start splitting our tree from that feature which returns the highest information gain (that is, in poor words, the difference in entropy before and after the splitting procedure).
Let’s see an example with the following dataset:
We want to classify the e-mail we receive as Spam or Not Spam, depending on their object, sending time and length of the message. In order to decide which attribute we should elect as the root of our tree, we have to follow these steps:
- Compute the Entropy of our target variable:
- Compute the Entropy with respect to each feature/attribute:
- Compute the Information Gain (IG) for each feature:
Well, now we can use our dataset to show the full procedure. I will dwell on each step just for the first feature, ‘Object’, then for the remaining ones I will display only the results. However, keep in mind that the procedure is exactly the same.
So the first thing to do is creating the relative frequency table for our feature:
Then, we compute the entropy for both the values of our attribute, ‘job’ and ‘publicity’:
Moving on, we compute the Entropy with respect to our Object feature:
Finally, we compute the Information Gain of our feature:
So, if we elect as root of our tree the feature ‘Object’, we will obtain 0.04 bits of information. If we repeat this procedure also for ‘Sending time’ and ‘Length of message’, we obtain, respectively, 0.01 and 0.11. Hence, we now know that our tree should be split starting from ‘Length of message’. We can visualize this first stage of our tree, as well as the two new datasets we are provided after splitting our feature:
As you can see, we now have two branches. A branch with entropy equal to 0 should be converted into a leaf node: no further splits are needed in that case. On the contrary, a branch with entropy greater than 0 needs further splitting. Then the same reasoning is repeated for the new nodes until we reach our final output. Namely, it resulted that, for branch ‘short’, splitting the feature ‘Object’ guarantees a greater IG than ‘Sending time’ (0.025 versus 0.015).
The same reasoning is valid for each decision node, so you can proceed with your computations of Information Gain and split accordingly.
Now let’s apply this concept while building our Decision Tree algorithm on our well-known dataset Iris:
Now let’s install the tree-classifier package and train our model on our data:
tree<- rpart(Species~., data = data_train, method = 'class')
As you can see, our model decided to elect as root the feature petal.Length: that’s because it was the feature whose splitting guarantee the highest gain of information. Indeed, by using as threshold the value 2.5, one branch (<2.5) has already become a leaf and no further splittings are required. On the other hand, the other branch (>2.5) still requires a further splitting on the feature Petal.Width: once done that, the model was able to classify all the data in our three classes.
As I anticipated, Information Gain is not the only splitting criterion you can set for your decision tree (to quote some other criteria, we have Gini coefficient, Chi-squared, and Reduction in Variance criteria). Furthermore, once built, your tree needs to be optimized in order to reduce overfitting (the so-called ‘leaves pruning’ phase).
However, Information Gain is definitely a valid criterion and it is probably the most intuitive.