The AI Odyssey: Decision Trees

Published in

AI Odyssey

5 min readOct 31, 2023

What are Decision Trees?

Have you ever heard about Decision Trees? You have probably heard about them in decision analysis, where they are useful tools for identifying the best strategy to achieve the desired objective. However, Decision Trees are also popular in the fields of statistics, data mining, and machine learning because of their ease of interpretation, visualization, and implementation.

They are used and seen as predictive models to draw conclusions about a set of observations for both classification and regression among the supervised learning methods in machine learning.

How they look and how they work:

A decision tree has a flowchart-like structure where it starts with the root node and is linked with internal nodes and leaf nodes through branches. The root node is the initial node at which the dataset starts dividing into branches based on features or attributes (represented by internal nodes), which lead at the end to the leaf nodes or terminal nodes, which are all the possible outputs within the dataset.

(source: IBM)

This algorithm selects the most suitable internal node (feature) using an attribute selection measure to split the dataset into subsets. This process of splitting is done recursively until all the records are classified in specific classes, there are no more internal nodes left to split or a specified number of nodes is reached.

(source: IBM)

Methods to choose the best attribute:

There are several ways to select the best feature at each level of the decision tree, in particular, Information Gain and Gini Impurity are one of the most used splitting criteria for this algorithm.

Information Gain: before starting talking about it, we need to clarify the concept of Entropy. As some of you have probably heard this word from chemistry, Entropy in machine learning is also related to “randomness”, but here it is related to the level of disorder or uncertainty in a given dataset or system.

It is represented by the following formula:

where:

“S” is the dataset;
“c” is the classes in S;
“p(c)” is the proportion of data points in c to the total number of data points in S.

Entropy values can fall between 0 (lowest level of entropy) and 1(highest level of entropy). For example, if all samples in a dataset belong to one class, there is no randomness, therefore the value of the entropy is equal to zero. Whilst, if the samples are classified half in one class and the other half in another class of the dataset, randomness is at its peak, meaning at 1. Now that we have understood what entropy is, we can define Information Gain, which is the difference in entropy before and after a split on a given attribute. In order to split and find the optimal decision tree, the attribute with the lowest value of entropy should be used, in other words, the attribute with the highest amount of information gain will produce the best split.

Gini Impurity: it measures the level of impurity in a dataset, more simply, the probability of random data points being classified incorrectly in the dataset. It is represented by the following formula:

Similarly to entropy, if all the data points belong to one class, the set S is pure, namely all the data points cannot be classified incorrectly. The probability p would be equal to one and if we plug it in the formula, Gini Impurity is equal to zero.

Brief recap: to find the attribute that determines the best split for a particular node in the tree we have to consider the one which provides the highest level of Information Gain or the one with the lowest value of Gini Impurity.

Types of Decision Trees:

ID3 and CART are the two main algorithms used to create decision trees.

ID3: which stands for “Iterative Dichotomiser 3”, was developed by Ross Quinlain in 1986 and it is usually used for classification tasks. It uses Information Gain for splitting and choosing the best attribute. One of its features is that it does not produce binary trees, but multi-way trees, where each node can have more than two children (subnodes in the tree).

CART: which is the abbreviation of “Classification and Regression Trees” was developed by four professors: Breiman, Friedman, Olshen, and Stone. It was unveiled in 1977. Differently from ID3, it can be used for both classification (using Gini Impurity for selecting the best feature to split on) and regression tasks; furthermore, it produces binary trees, meaning every node has only two children.

Advantages of Decision Trees:

Even though other algorithms outperform Decision Tree algorithms, they have some positive aspects:

- It’s easy to interpret: The hierarchical nature, the boolean logic, and visual representations are the key factors that let decision trees be intuitively understandable, and for this reason are a great tool for data exploration.

- There is no need for data preparation: they can handle different data types: discrete or continuous values, categorical and numerical data, and missing values.

- It’s more flexible: Decision trees can be used for both classification and regression tasks and due to this feature it is more flexible than some other algorithms.

Disadvantages of Decision Trees:

Even though it is easy to visualize or interpret, Decision trees have also negative aspects:

- Problem of overfitting: Trees can easily become too complex, capturing noise in data, leading to overfitting. This issue can be solved by the action of pruning which consists of removing branches with inadequate data. By pruning, we can increase the tree’s predictive performance on new data.

- High instability: small variations in data can lead to a very different decision tree. Reducing the variance of decision trees is a possible solution to this characteristic (bagging or boosting are methods usually used to deal with it)

- High cost: if the decision trees take a greedy search approach during construction, they are more expensive to train with respect to other algorithms.

Conclusion:

As we have seen so far, decision trees are one of the fundamental supervised learning algorithms in machine learning, and the fact that it is pretty simple to use and interpret, it is an incredibly powerful tool.