Classification Algorithms-3: Decision Tree

Sathya Krishnan Suresh
5 min readApr 8, 2022

--

Sathya Krishnan Suresh, Shunmugapriya P

Decision Tree classifier is one of the most popular classification algorithms at the moment. It is versatile, intuitive and powerful. One of the biggest advantages of decision tree classifier is that it is easy to understand. That said, let’s take a look at the working of decision tree algorithm in this article(Notebook to go along with the article can be found here).

Decision tree works based on the decisions it develops or identifies from the input data. Let’s understand this through an example where we use DT to classify the iris dataset(petal width and petal length only).Take a look at the following figure.

DT with max_depth=2

In the above figure, DT first classifies the samples based on the condition that petal length should be less than or equal to 2.45. If an input vector satisfies that condition it is likely to belong to the ‘setosa’ class. We can confirm this by taking a look at the dataset below, where we can clearly see that samples with petal length less than 2.45 belong to class 0 (setosa)

Once it classifies setosa completely, it then goes on to classify the other two classes but this time based on the ‘petal width’ feature. If the max_depth value had not been given as 2 DT would have looked to find more decision boundaries until it could put separate groups into separate leaves or in DT terms, until gini becomes 0.

So what is gini?

Gini is the measure of impurity of a node/leaf. It is the metric based on which the decision tree forms its decisions. DT starts by randomly picking a value from a feature, which is also randomly picked. The value picked will be used to split the dataset into two leaves, one leaf contains the data points that lie below the selected value and the other leaf contains the data points that lie above the selected value. Then the gini impurity for the leaves are calculated using the following formula

Gini impurity

‘p’ in the above formula is the probability of a data point belonging to that leaf. Once the gini values are calculated for the leaves, the weighted average of gini impurities becomes the gini value of the node. The above process is repeated for other randomly selected values and the value for which the gini value is least is selected. If the ‘max_depth’ parameter is not mentioned this process is repeated until all the leaves contain only a single class or if a new decision point cannot be found.

One of the biggest problems with DT is that the decision boundaries developed by it are always orthogonal to one of the axes and it does not form a decision boundary that is inclined at an angle other than 90 degree to the axis. For splits that can be done with a simple angled line, DT takes three or four lines perpendicular to the axes.

One another problem with DT is that it can easily overfit the data if left unchecked. There are a number of regularization parameters for DT and we will take a look at 3 of them.

Regularization:
1. max_depth
2. min_samples_split
3. min_samples_leaf

  1. max_depth:
    This parameter specifies the depth of DT. It’s default value is None in scikit-learn. If you train a DT with max_depth as None then the model will build a tree until the leaves contain samples belonging to one particular class and it consequently, leads to overfitting. To overcome this problem you can tune the max_depth parameter to have a low value so that the model does not overfit the data. The following plots compare two models with different max_depth values and you can clearly see that the model on the left is overfitting and it will not be able to generalize well whereas the model on right allows for one or two misclassifications and it will generalize well.
max_depth comparison

2. min_samples_split:
Sometimes, if a node contains 10 samples belonging to two different classes out of say 1 million samples, DT will still try to split that node if max_depth value has not been reached. Splitting that node is pointless, as it is just a small part of the whole training dataset. In order to prevent those situations min_samples_split parameter can be used which specifies the minimum number of samples a node must have if it is going to be split further. This in combination with max_depth will greatly reduce the effect of overfitting. Once again a plot is made to support the above statements.

3. min_samples_leaf:
This hyperparameter follows from the min_samples_split discussed above. min_samples_leaf specifies the minimum number of samples of a leaf must have thereby preventing the splitting of a node if the condition is not meant.

All these regularization methods are called ‘Pruning the decision tree’ as you are cutting out nodes that lead to overfitting.

Conclusion:
Lots of concepts discussed in this article. Take a look at the notebook that goes along with this article. I hope you had as much fun as I had while writing this article. Clap, subscribe and leave a comment.

--

--