Decision Trees — Easily Explained

ZHOU Rui
Titansoft Engineering Blog
8 min readMar 31, 2020

--

A decision tree is a decision support tool that uses a tree-like representation of decisions and their possible consequences. It is a fundamental method in Machine Learning(ML) for classification.

Before digging further into decision trees, there are a few basic ML concepts to be clarified if you are new to ML☺.

Classification vs Regression

  • Binary Classification : to predict a binary output with exactly one out of two possible values, given inputs which could either be continuous or categorical. E.g., Will it be sunny tomorrow?
  • Multi-class Classification (not to be confused with multi-label): to predict a categorical output with more than two classes of possible values, given inputs which could be either continuous or categorical. E.g., Will it be sunny, cloudy, or rainy tomorrow?
  • Regression: to predict a continuous value, given inputs which could be either continuous or categorical. E.g., What will the temperature be tomorrow?

Fundamentally, classification is about predicting a class label and regression is about predicting a continuous quantity.

Predictive Modeling

A predictive model f(x|𝜽) uses historical data to make a prediction of future events, where

  • x is the feature vector of input attributes
  • the output (also called the prediction) is denoted as ŷ
  • and 𝜽 represents all the parameters of the model

Supervised Learning

Supervised machine learning algorithms are designed for the algorithm to “learn by example”, with the training data consisting of inputs paired with the correct outputs.

Customer credit rating of bank A
  • Training data: A dataset of pairs Dtrain = {(x, y)}, where y is the true label to the corresponding x.
  • Then, create a model f(), and tune 𝜽 to minimize the discrepancy between y and ŷ, whose measurement is called loss function. Loss function maps decisions to their associated costs, in other words, quantifies how costly forecast errors are.

Now that you’ve got a good sense of the basic concepts in ML, let’s learn how exactly a decision tree works.

Decision Tree

A Decision Tree is a tree-structured model, where

  • The internal nodes represents tests against the attributes of the input sample, to determine the traversal path from root to leaves.
  • The leaf nodes gives the categorical prediction.

Here’s how we use an existing decision tree to predict whether a customer’s credit rating is good or bad, based on various features:

Example of predicting ŷ based on x upon an existing decision tree

The illustration above is easy enough to understand, but here’s the problem: how do we build the tree in the first place?

Let Dt denote the set of training records that reach a node t, and D0 be the root node containing all records. The general procedure is as follows:

  1. If Dt contains records that belong to the same class yt, then t is a leaf node labeled as yt.
  2. If Dt contains records that belong to more than one class, then use an attribute to test and split it into smaller subsets, i.e., subtrees.
  3. Recursively apply the same procedure to each subtree.

You may then wonder how the training data should be split, how we should evaluate each split, and when to stop splitting?

How do we split data to build decision trees?

  1. Splitting for normal attributes
  • Multi-way split: use as many partitions as distinct values
Multi-way split for normal attributes
  • Binary split: divides values into two subsets
Binary split for normal attributes

2. Splitting for ordinal attributes

  • Multi-way split: use as many partitions as distinct values
Multi-way split for ordinal attributes
  • Binary split: divides values into two subsets and preserve order property among attribute values
Binary split for ordinal attributes

3. Splitting for continuous attributes

  • Multi-way split
Multi-way split for continuous attributes
  • Binary split
Binary split for continuous attributes

Evaluation of splitting

Customer classification of bank A

Given the table above, how can we best split the classes of customers?

Splitting by different attributes (C0/C1: number of samples from Class 0/1)

Naturally, we prefer nodes with purer class distributions. In other words, we want the percentage of numbers of a single class in a node to be the higher the better. Based on the picture above, it seems that splitting by Customer ID has the most certain results (purest class distribution). However, is it really the best choice? We need measurements to quantify the impurity (or purity if you want) of nodes.

Measuring impurity of nodes

There are three widely used methods to evaluate impurity, which are Gini, Entropy and Classification Error.

  1. Gini Impurity
  • Gini impurity of a single node:

where P(j|t) is the relative frequency of class j at node t. We can also consider it in another way too: Gini equals to the sum of the probability of the samples when node t are labeled as class j , but the ground truth of them is not class j.

Gini reaches its maximum value 0.5 (highest impurity) when samples are equally distributed among all classes in a node; Gini reaches its minimum value 0 (least impurity) when all samples belong to one class in a node. Here’s an example how we calculate Gini of a node:

Example of calculation of Gini of a single node
  • Gini impurity after splitting:

We now know how to calculate Gini in a single node, next let’s look into the calculation of Gini after splitting node t by attribute j.

When a node t is split into k children by attribute j,

where ni is the number of sample at the i-th child node, n is the number of samples in the parent node t. Obviously we should pick an attribute that can minimize Gini_split (and maximize Information Gain).

Example of calculation of Gini_split

After splitting, Information Gain = 0.486–0.361 = 0.125

2. Entropy

  • The entropy of a distribution P of symbols(like a, b, c, d) is

where S denotes the variable for the symbol, pi is the probability of the symbol i being sampled.

  • Entropy at a given node t is

where Y denotes selected attribute to split, P(j|t) is the relative frequency of class j at node t.

It reaches its maximum value 1 when samples are equally distributed among all classes; it reaches minimum 0 when all samples belong to one class.

  • Entropy impurity after splitting:

where parent node t is split into k children, and ni denotes the number of samples in child i. Information Gain is

  • Problem with a large number of children

Node impurity measurements tend to prefer splits which result in a large number of children, each being small but pure:

Customer ID has the highest information gain because entropy for all the children is zero, but it is not a good split because it cannot work for a new customer with a new ID.

Gain Ratio

We can adjust Information Gain(IG) by dividing the entropy of the partitioning (SplitINFO), thus IG with higher entropy partitioning (large number of small partitions) is penalized by dividing a larger value.

3. Classification Error

  • Classification Error at node t:

Classification Error reaches its maximum when samples are equally distributed among all classes, and reaches its minimum when samples belong to one class.

Like Gini and Entropy impurity, the error of splitting is

  • Computing Error of a Single Node

Comparison among impurity measurements

For a 2-class problem:

In general, by using Entropy and Gini, we will generate the same trees, which have better performance than Classification Error. Let’s look at an example:

Comparison between Gini and Classification Error

In this example, Gini impurity improved following both ways of splitting, but you can easily from the calculations that all the classification errors (0.3) remain the same!

Advantages and disadvantages of using tree based classification

Advantages

  • Inexpensive to construct.
  • Extremely fast at classifying unknown records.
  • Easy to interpret for small-sized trees.
  • Robust to noise (especially when methods to avoid overfitting are employed).
  • Can easily handle redundant or irrelevant attributes (unless the attributes are interacting).

Disadvantages

  • Space taken up by decision trees can become exponentially large, with greedy approaches often unable to find the best tree.
  • Does not take into account interactions between attributes.
  • Each decision boundary involves only a single attribute.

Summary

This article introduces the concept of decision trees, which is fundamental for classification in Machine Learning. Beyond that, there are the more advanced algorithms of bagging (Random Forest) and boosting (AdaBoost, Xgboost, Lightgbm, etc.), which combines multiple decision trees in different ways to form a strong learner in order to obtain a much better performance than a single decision tree (weaker learner).

In Titansoft, embracing Machine Learning has proved effective for the different products we build. For example in user behavior detection, Xgboost works well in identifying users with desirable or undesirable behaviors from a large pool of users. Based on that, we can go on to optimize and customize our product for different groups of users to achieve different results.

--

--