A-Z of Decision Trees
Part 1: Key Terms and Intuition behind Decision Trees
Hello my dear philomaths!! Many of us here understand the concepts behind a decision tree, but few are able to grasp the math behind it.
In the 2 part tutorial, I will try to explain the basics of a decision tree, how it works, advantages and disadvantages, how to construct a decision tree and some other mumbo jumbo like overfitting and underfitting. Let’s get started.
What is a Decision Tree ??
A Decision tree is a supervised algorithm in Machine Learning, which can be used for both Classification and Regression. They are non-parametric in the sense that they do not assume the underlying distribution of the dataset. Decision trees will learn from your dataset to approximate an if-then-else condition to arrive at the classification value. The deeper the tree, the more overfit the model is (more on that later). You can think of it simply as a flowchart.
Geometric Intuition of a Decision Tree
For a multi dimensional dataset (multiple features), the Decision tree will be split at each level based on each feature.
For example, for our IRIS dataset below, we can split the decision tree based on sepal length, sepal width, petal length or petal width on each subsequent level, till we come to a conclusion as to what class label we can assign (refer to Fig 2).
However, a question might arise, as to why we considered petal length as the initial classifier and not petal width. We will know soon enough.
Root Node and what not!!
Now, since we have a simple DT to look at, let’s get a few terminologies out of the way.
In Fig 2, “Petal length” can be considered as the root node, since we are branching out starting from this node. This is also Level 1, if you think so.
Subsequently, “Petal width” at Level 2, is a child node, for obvious reasons and at Level 3 we have the leaf nodes, after arriving where we can make a prediction.
From a geometrical viewpoint, if we consider a 2D scatter plot between petal length and petal width, we can think of dividing the 2D plane into ‘axes parallel sectors at each level’, like in the below image.
Understanding Entropy
Just like in thermodynamics, similarly in information theory we define entropy for a distribution. In simple terms, entropy is a measure of ‘unpredictability’ or ‘randomness’ of data. For example, if we consider a fair coin, there are only 2 outcomes (Head,Tail). So we can say that it has an entropy of 2.
In statistical terms however, the definition is a bit trivial and is as below. Here P(x) represents the percentage of occurrences summed over ’n’ observations. Because P(x) ≤ 1, the negative sign is chosen to yield the the entropy positive, due to the log function.
Consider the below dataset. We have 4 features (Outlook, Temperature, Humidity and Windy) and one class label variable (Play).
Decision column consists of 14 instances and includes two labels: yes and no. There are 9 decisions labeled ‘yes’, and 5 decisions labeled ‘no’. so the entropy of the dataset can be defined as
Intuition behind Entropy
Since now we understand how to determine entropy, let’s understand how entropy represents data.
This curve represents the plot of P(x) vs H(X) for a 2 class variable. We consider 3 cases to understand the curve.
Case 1 : When P(x) ~ (1 -P(x))
If the class labels are almost equally distributed, then the entropy assumes maximum value, which is 1
H(X) = -(0.5 * log (0.5))-(0.5 * log(0.5)) = 1
Case 2 : When P(x) >> (1 -P(x))
In such cases, where one class label is predominant over another, the entropy decreases.
H(X) = -(0.99 * log (0.99))-(0.01 * log (0.01)) = 0.0801
Case 3: When P(x) = 1
When there is only one class label, then entropy is ‘0’
We see from the above cases, that entropy gives us an idea about the randomness or distribution of data. A high entropy means that points are equally distributed, and the goal of our algorithm will be to reduce this uncertainty.
Conclusion
Even for a Gaussian distribution, we can determine the entropy from its PDF. If the PDF is too ‘peaked’, that means, there are more similar values near the mean than compared to another Normal distribution, which has a higher spread.
I hope, the basics of a Decision Tree are clear by now. We have also learned what Entropy means and how do we calculate it. In the next part, we will learn a few more definitions related to DT and see how to construct a Decision Tree.
References :