Introduction to decision trees

Published in

Website categorization

4 min readApr 13, 2022

Decision trees are methods which use a tree-like structure of conditional statements and decisions.

Most common use is in decision analysis, data mining, web content filtering, machine learning problems.

In ML, decision trees can be a good model both employed as a single model and also by using them for more complex models, e.g. random forests or gradient boosting machines.

Decision trees are a variant of supervised machine learning model and they can be employed for problems that are either classification or regression kind of tasks. We have used it for finding categories of websites.

Often used expression is CART (Classification and Regression Trees).

As its name suggest decision trees look like a tree (see Fig. 1), from:

decision nodes which partition into two or more branches,
leaf or terminal nodes which are the end part of the branches meaning there are no further splits).

The node at the top of the tree is root node and it is the best predictor.

Figure 1: Decision tree

Decision trees are “grown” by seeking at each decision node for that feature that best separates the data into subsets that are most homogenous. This is done in recursive manner until classification is done for each leaf of the decision trees.

When dealing with large decision trees we often encounter overfitting or and such decision trees can then perform poorly when applied to new, unseen samples.

For deal with this, we need regularization and one approach for this is controlling their size.

One approach for controlling size is by pruning down what will remain after expansion, or “training,” of a model’s branches in order that only those with high predictive value are kept and used on future data sets.

Pruning can start at the root level or from bottom to top. One latter approach is reduced error pruning, which starts at leaves with each node replaced with its most popular class.

Decision tree inference is performed by following the tree from the root node, testing the appropriate attribute at each decision node until reaching a leaf node and determining the class.

This can be e.g. a product, as part of product classifier or for website categorization.

Growing the decision trees

The most important task in growing of decision trees is to decide which attribute or feature best splits the data into most homogenous subsets at each node.

There are different metrics that are useful for this: Gini impurity, Information gain and Variance reduction.

Gini impurity

Gini impurity measures the likelihood of wrong classification of a new data point, randomly chosen from the set, if this new data point was randomly labelled according to the distribution of labels in the data set. Gini impurity is calculated with the following formula:

where p_i is fraction of instances labelled with class i in the data set.

Information gain

Information gain is based on the concept of entropy, which in our case measures the degree of homogeneity in group of data points.

Entropy can be calculated with:

where P_i is proportion of data that belong to class i. Entropy is 0 if all samples belong to the same class and is large when proportions for different classes are equal (see Fig. 2).

Figure 2: Entropy for two different distributions

In context of decision trees, information gain measures the change in entropy after we split the data set on specific attribute.

Variance reduction

Variance reduction is usually applied when considering a regression problem. In this approach, the best splits are those which most decrease the combined weighted variance of child nodes

In most cases for CART problems, Gini impurity and Information gain deliver very similar results. Gini impurity does however have an advantage over Information gain in being slightly faster, as there is no need for log calculations.

Decision Tree Algorithms

Decision trees can be built with many different algorithms, including:

ID3,
CART,
5,
Chi-square automatic interaction detection (CHAID).