DECISION TREE

Published in

Analytics Vidhya

5 min readAug 24, 2020

The decision tree falls under the category of supervised machine learning technique, it is also referred to as CART (Classification and Regression Trees). It utilises a tree structure to model relationships among the features and the outcomes. It consists nodes which represents decision function and branches which represent the output of the decision functions. Thus, it is a flow chart for deciding how to classify a new data point.

The decision selects the best attribute using Attribute Selection Measures(ASM) to split the records. The tree criterion splits the data into subsets and subsets into further smaller subsets. The algorithm stops splitting the data when data within the subsets are sufficiently homogeneous. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The decision tree can be used for both classification and regression problems, but they work differently.

Decision Tree for Classification Problem :

The posterior probability of all the classes are reflected in the leaf node and leaf node belongs to the majority class. After the execution, the class of the data point is decided by the leaf node to which it reaches.
The objective is to minimise the impurity as much as possible at the leaf node.
The loss function is a measure of impurity in target column of nodes belonging to parent. Impurity at a node is measure of mixture of different classes in the target column of a node.

Decision Tree for Regression Problem :

The average or median value of the target attribute is assigned to the query variable.
The objective is to minimise the variance(dissimilarity of a data point from the central value) in the target column at each node.
Decrease in variance is equivalent of increase in homogeneity or purity.

Measure of accuracy of splitting the tree :

Impurity:

The tree splits the data into subsets that are insufficiently homogeneous, referred as impure.

Why does this matter? Depending on which impurity measurement is used, tree classification results can vary. This can make small or sometimes large impact on your model.

Entropy :

Entropy controls how the decision tree decides where to split the data. It is the measurement of impurity or randomness in the data points.

Entropy is calculated between 0 and 1. The smaller value of entropy the better.

For example, let’s say we only have two classes , a positive class and a negative class. Therefore ‘i’ here could be either (+) or (-). So if we had a total of 100 data points in our dataset with 30 belonging to the positive class and 70 belonging to the negative class then ‘P+’ would be 3/10 and ‘P-’ would be 7/10. So, the calculation of entropy of classes in this example using the formula above.

The entropy here is approximately 0.88. This is considered a high entropy, a high level of disorder(meaning low level of purity or highly impure split).

Information Gain :

Information gain computes the difference between entropy before split and average entropy after split of the dataset based on given attribute values.

It is applied to quantify which feature provides maximal information about the classification based on the notion of entropy, i.e. by quantifying the size of impurity, with the intention of decreasing the amount of entropy initiating from the root node to the leaf nodes.

We simply subtract the entropy of Y in given X from the entropy of just Y to calculate the reduction of uncertainty about Y given an additional piece of information X about Y. This is called Information Gain. The greater the reduction in this uncertainty, the more information is gained about Y from X.

Gini Index or Gini Impurity :

It calculates the amount of probability of a specific feature that is classified incorrectly when selected randomly. If all the elements are linked with a single class then it can be called pure.

Where Pj denotes the probability of an element being classified for a distinct class.

The Gini index varies between values 0 and 1, where 0 expresses the purity of classification or perfect classification. Classification and Regression Tree (CART) algorithm deploys the method of the Gini Index to originate binary splits.

Gini Index VS Information Gain :

The Gini Index facilitates the bigger distributions so that it is easy to implement whereas the Information Gain favors lesser distributions having small count with multiple specific values.
Gini index operates on the categorical target variables in terms of “success” or “failure” and performs only binary split, in contrary the Information Gain computes the difference between entropy before and after the split and indicates the impurity in classes of elements.

Error difference using Gini Entropy VS Entropy :

Miscellaneous Error is not used in Decision Tree, graph only applicable for Binary Classification.

Advantages and Disadvantages :

Advantages :

Simple and fast in processing and effective.
Does well with noisy data and missing data.
Handles numeric and categorical variables.
Interpretation of results does not require math or statistical knowledge.

Disadvantages :

Overfit very easily.
Often bias towards splits or features have large number of levels(tree depth).
Small changes in training data can result in large changes to the logic.
Large trees can be difficult to interpret.

Hope this blog helped you get better understanding of Decision Tree. If you like it support it with a clap. Happy Learning… :)