# Binary Decision Trees

A Binary Decision Tree is a structure based on a sequential decision process. Starting from the root, a feature is evaluated and one of the two branches is selected. This procedure is repeated until a final leaf is reached, which normally represents the classification target you’re looking for.

One of the first formulations of Decision Trees is called **Iterative Dichotomizer 3 (ID3)**, and it required categorical features. This condition restricted its use and led to the development of **C4.5**, which could also manage continuous (but binned and discretized) values. Moreover, C4.5 was also known because of its ability to transform a tree into a sequence of conditional expressions (if <condition> then <…> else <…>).

Considering other algorithms, Decision Trees seem to be simpler in their dynamics. However, if the dataset is splittable while keeping an internal balance, the overall process is intuitive and rather fast in its predictions. Moreover, Decision Trees can work efficiently with un-normalized datasets because their internal structure is not influenced by the values assumed by each feature.

The following graph plots an unnormalized bidimensional dataset, as well as the cross-validation scores obtained using logistic regression and a Decision Tree:

The Decision Tree always achieves a score close to **1.0**, while the logistic regression has an average slightly greater than **0.6**. However, without proper limitations, a Decision Tree could potentially grow until a single sample (or a very low number) is present in every node. This situation drives the overfitting of the model, and the tree becomes unable to generalize correctly.

Using a consistent test set or cross-validation, together with a maximum allowed depth, can help in avoiding this problem. Another important element to take into account is class balancing. Decision Trees are sensitive to unbalanced classes and can yield poor accuracy when a class is dominant. To mitigate this problem, it’s possible to employ one of the resampling methods or use the class_weight parameter, which is provided by the scikit-learn implementations. In this way, a dominant class can be proportionally penalized, avoiding bias.

### Binary decisions

Consider an input dataset, **X**:

Every vector is made up of **m** features, so each of them is a candidate for the creation of a node based on a tuple (feature, threshold):

#### Single splitting node

According to the feature and the threshold, the structure of the tree will change. Intuitively, you should pick the feature that best separates your data. In other words, a perfectly separating feature will only be present in a node, and the two subsequent branches won’t be based on it anymore. These conditions guarantee the convergence of the tree toward the final leaves, whose uncertainty is minimized. In real problems, however, this is often impossible, so it’s necessary to find the feature that minimizes the number of decision steps that follow on from this.

For example, consider a class of students where all male students have dark hair and all female students have blonde hair, while both subsets have samples of different sizes. If your task is to determine the composition of the class, you can start with the following subdivision:

However, the **Dark color?** Block will contain both males and females (which are the targets you want to classify). This concept is expressed using the term **purity** (or, more often, its opposite concept, **impurity**). An ideal scenario is based on nodes where the impurity is null so that all subsequent decisions will be taken only on the remaining features. You can simply start from the color block:

The two resulting sets are now pure according to the color feature and this will be enough for your task. If you need further details, such as hair length, other nodes must be added; their impurity won’t be null because you know that there are, for example, both **Male** and **Female** students with long hair.

More formally, suppose you define the selection tuple as follows:

Here, the first element is the index of the feature you want to use to split your dataset at a certain node (it will be the entire dataset only at the beginning; after each step, the number of samples decreases), while the second is the threshold that determines left and right branches. The choice of the best threshold is a fundamental element because it determines the structure of the tree and therefore its performance. The goal is to reduce the residual impurity in the least number of splits to have a very short decision path between the sample data and the classification result.

You can also define a total impurity measure by considering the two branches:

Here, **D** is the whole dataset at the selected node, **Dleft** and **Dright** are the resulting subsets (by applying the selection tuple), and **I** represents the impurity measures.

### Impurity measures

To define the most frequently used impurity measures, you need to consider the total number of target classes:

In a certain node, **j**, you can define the probability **p(y = i|Node = j)**,** **where **i** is an index **[0, P-1]** associated with each class. In other words, according to a frequentist approach, this value is the ratio between the number of samples belonging to class **i** and assigned to node **j** and the total number of samples belonging to the selected node:

#### Gini impurity index

The Gini impurity index is defined as follows:

Here, the sum is always extended to all classes. This is a very common measure and it’s used as a default value by scikit-learn. Given a sample, the Gini impurity measures the probability of a misclassification if a label is randomly chosen using the probability distribution of the branch. The index reaches its minimum **(0.0)** when all the samples of a node are classified into a single category.

#### Cross-entropy impurity index

The cross-entropy measure is defined as follows:

This measure is based on information theory, and assumes null values when samples belonging to a single class are present in a split, while it is at its maximum when there’s a uniform distribution among classes (which is one of the worst cases in Decision Trees because it means that there are still many decision steps until the final classification). This index is very similar to the Gini impurity, even though, more formally, cross-entropy allows you to select the split that minimizes uncertainty about the classification, while the Gini impurity minimizes the probability of misclassification.

The concept of mutual information is defined as **I(X; Y) = H(X) — H(X|Y)**, as the amount of information shared by both variables, or the amount of information about **X** that you can obtain through knowledge of **Y**. It’s also helpful to consider the data generating process, **p(x)**, and the conditional probability, **p(x|y)**. It’s easy to prove that the mutual information is the expected value of the Kullback-Leibler divergence between them, with respect to the conditioned variable **y**:

Therefore, this is when **I(X; Y) → 0**, **p(x) ≈ p(x|y)** and knowledge of **Y** doesn’t provide any useful information about **X**. Conversely, higher **I(X; Y)** values imply an average strong divergence between the marginal distribution **p(x)** and the conditional **p(x|y)**. The definition of conditional probability can provide better insight:

Here, the two variables are not statistically independent and knowledge of **Y** must provide proportional information about **X**. You can use this concept to define the **information gain** provided by a split (which is formally equivalent to the mutual information):

In this case, you are interested in the distributions of the parent node and the children. Therefore, when growing a tree, start by selecting the split that provides the highest information gain. This guarantee minimization of the impurities of the subsequent nodes because, in general, the most relevant features are selected at the beginning and the next one is reserved for fine-tuning. The growth of the tree proceeds until one of the following conditions is verified:

· All nodes are pure

· The information gain is null

· The maximum depth has been reached

#### Misclassification impurity index

The misclassification impurity index is the simplest index and is defined as follows:

The interpretation is straightforward but, unfortunately, in terms of quality performance, this index is not the best choice because it’s not particularly sensitive to different probability distributions. In all of these cases, Gini or cross-entropy indexes are the most natural choice.

### Feature importance

When growing a Decision Tree with a multidimensional dataset, it can be useful to evaluate the importance of each feature in predicting the output values. Decision Trees offer a different approach based on the impurity reduction determined by every single feature. In particular, considering a feature, **x(i)**, its importance can be determined as follows:

The sum is extended to all nodes where **x(i)** is used, and **Nk** is the number of samples reaching the node, **k**. Therefore, the importance is a weighted sum of all impurity reductions computed, considering only the nodes where the feature is used to split them. If the Gini impurity index is adopted, this measure is also called **Gini importance**.

*This article is written by **Packt Publishing**, the leading UK provider of technology eBooks, coding eBooks, videos and blogs; helping it professionals to put the software to work.*