An Introduction to Decision Tree Learning: ID3 Algorithm

Published in

Machine Learning Guy

8 min readMar 12, 2018

This model is very simple and easy to implement. But, if you like to get more insight, below I give you some important prerequisite related to this model.

Prerequisite:
- Data structure (Tree)
- Searching algorithms (greedy algorithm, heuristic search, hill climbing, alpha-beta pruning)
- Logic (OR, AND rules)
- Probability (Dependent and Independent)
- Information Theory (Entropy)

Applications:
- Operational Research
- Finance
- Scheduling problems
- etc

In this episode of Decision Tree, I will give you complete guide to understand the concept behind Decision Tree and how it work using an intuitive example. In the next episodes, I will show you the easiest way to implement Decision Tree in Python using sklearn library and R using C50 library (an improved version of ID3 algorithm).

Motivation

Decision Tree is very very very simple model. It is widely known and used in many businesses to support decision making process and risk analysis. It is also one of legendary learning model which is heavily used in ’60–80’s to build expert systems. One of very popular expert systems which adopt decision tree ( almost cs and informatics student knew about) is Mycin which developed at 1970 by Buchanan and Cohen. But, like another classic expert systems, Mycin is not fully automatic operated. At that years, human experts still needed to input hard coded rules into expert systems. After 80’s, this model has been lost popularity since it’s seems can not be extended using more sophisticated mathematics.

But now, Decision Tree learning start gaining popularity since some machine learning practitioners proved that inferior algorithm with bigger data may beats sophisticated algorithm. Beside that, it is worth to learn Decision Tree learning model at first place, before jump into more abstract models, such as, Neural Network and SVM (Support Vector Machine). By learning Decision Tree, you will have better insight how to implement basic probability theory and how to transform basic searching algorithm into machine learning algorithm.

In this episode, I will introduce you ID3 method which is one of widely used decision tree algorithm.

What is Decision Tree in One Sentence?

Simply, it is enhanced greedy search algorithm that implement heuristic function using probability as comparison values, but does not implement backtracking (because it is greedy!).

Why Decision Tree in Two Sentences?

Suppose, you want to make a strategy which is contains many decisions, then it is not good idea to use brute force algorithm to decide which is best decision combination. So, it is good idea to implement decision tree algorithm which use heuristic function to choose good decision combination.

When Use Decision Tree Learning?

The most commonly reasonable reason to used Decision Tree is when you realized that Naive Bayes will not satisfy your goal. Everybody knew that it is naive to seeing all features as independent variables.
If we have categorical data, for example: hot, mild, and cold.
If we have attribute-value pairs data, for example: an attribute ‘temperature’ may has some discrete values ‘hot’, ‘mild’, and ‘winter’.
If objective value has discrete output values, for example ‘yes’, or ‘no’.

Mathematical Tool to Use

Basically, we only need two mathematical tool to implement complete ID3 algorithms:

Entropy
It is a fundamental theorem which commonly used in information theory to measure important of information relative to its size. Let x is our training set contains positive and negative examples, then the entropy of x relative to this classification is:

Entropy function

Information Gain
In multivariate calculus, we have learn how to use a partial derivative of each variable relative to all other variables to find local optimum. In information theory, we used similar concept, we derive the original entropy of population to measure information gain of each attribute.
For training set x and its attribute y, the formula of Information Gain is:

I will give you an easy example in order to make sense the formula above. Suppose we face with binary classification ‘yes’ or ‘no’, then we label of bit 1 for yes, and label of bit 0 for no. One of our feature’s attributes is ‘Outlook’ (O) which has three possible values ‘Sunny’ (Os), ‘Overcast’ (Oo), and ‘Rain’ (Or). Then, the information gain of Outlook is:

ID3 Algorithm

For simplicity, I choose to write ID3 algorithm using pseudo code because it is more efficient and cleaner. Actually pseudo code format easier to read, although for who not learn algorithm before.

# x is examples in training set
# y is set of attributes
# labels is labeled data
# Node is a class which has properties values, childs, and next
# root is top node in the decision tree# Declare:
x = # Multi dimensional arrays
y = # Column names of x
labels = # Classification values, for example {0, 1, 0, 1}
         # correspond that row 1 is false, row 2 is true, and so on
root = ID3(x, y, label, root)# Define:
ID3(x, y, label, node)
  initialize node as a new node instance
  if all rows in x only have single classification c, then:
    insert label c into node
    return node
  if x is empty, then:
    insert dominant label in x into node
    return node
  bestAttr is an attribute with maximum information gain in x
  insert attribute bestAttr into node
  for vi in values of bestAttr:
    // For example, Outlook has three values: Sunny, Overcast, and Rain
    insert value vi as branch of node
    create viRows with rows that only contains value vi
    if viRows is empty, then:
      this node branch ended by a leaf with value is dominant label in x
    else:
      newY = list of attributes y with bestAttr removed
      nextNode = next node connected by this branch
      nextNode = ID3(viRows, newY, label, nextNode)
  return node

The python version of pseudo code above can be found at github. If you struggle with how to implement ID3 algorithm, then it worth to play with python version of pseudo code above. I also provides a data set that can be used for testing purpose.

An Intuitive Example: What the best day to Play Tennis?

Consider a table of data set below. Given the column “Play Tennis” as target attribute, and example of days which such conditions as attributes: Outlook; Temperature; Humidity; and Wind. We want to know what the best day to play tennis.

We want to construct a decision tree of the training set above using ID3 algorithm which has been described before.

# Declaration
For simplicity, we will give name of each attribute in data set:

O is Outlook attribute
T is Temperature attribute
H is Humidity attribute
W is Wind attribute

# First iteration
At first iteration, we need to know which is best attribute to be chosen as top root in our decision tree. To do that, ID3 will find the best attribute which is has maximum information gain. Given the information gain for each attribute:

attributes = [O, H, W, T]
G(x, O) = 0.246
G(x, H) = 0.151
G(x, W) = 0.048
G(x, T) = 0.029

So, based on the information gains calculated above, we choose attribute Outlook as first root node which has three branches Sunny, Rain, and Overcast. Our decision tree will look like an image below.

For the next iterations, we remove O from attribute list.

# Second iteration
We want examine which the best attribute for branch of Sunny. Remember, that new x is rows contains values of Sunny.

x = [x[0], x[1], x[7], x[8], x[10]]
attribute= [H, W, T]
G(x, H) = .970 - (3/5) * 0 - (2/5) * 0 = .970
G(x, T) = .970 - (2/5) * 0 - (2/5) * 1 = .570
G(x, W) = .970 - (2/5) * 1 - (3/5) * .918 = .019

So, based on the information gains calculated above, we choose attribute Humidity as attribute in branch Sunny. Our decision tree will look like an image below.

For the next iterations, we remove H from attribute list.

# Third iteration
Next node is an attribute Humidity which has two possible values {High, Normal}. A branch High dominated by single label which is No, caused this branch ended with a leaf contains label No. Same case with branch Normal which ended with a leaf contains label Yes.

# Four iteration
All rows contains value Outcast are dominated by single label Yes, so branch of Overcast ended with a leaf contains label Yes.

# Fifth iteration
We want examine which the best attribute for branch of Rain. Remember, that new x is rows contains values of Rain.

x = [x[3], x[4], x[5], x[9], x[13]]
attribute = [W, T]
G(x, W) = .970 - (2/5) * 0 - (3/5) * 0 = .970
G(x, T) = .970 - (0/5) * 0 - (3/5) * .918 - (2/5) * 1.0 = .019

So, based on the information gains calculated above, we choose attribute Wind as attribute in branch of Rain. Our decision tree will look like an image below.

For next iteration, we remove attribute W from attribute list

# Sixth iteration
Next node is an attribute Wind which has two possible values {Weak, Strong}. A branch Strong dominated by single label which is No, caused this branch ended with a leaf contains label No. Same case with branch Weak which ended with a leaf contains label Yes.

# Seven iteration
Actually, there is no iteration left, since all branches in our decision tree ended with leafs. In other word, we prune attribute Temperature from our decision tree.

Conclusion

Decision tree is a very simple model that you can build from starch easily. One of popular Decision Tree algorithm is ID3. Basically, we only need to construct tree data structure and implements two mathematical formula to build complete ID3 algorithm. The complete implementation of ID3 algorithm in Python can be found at github.

Suggestion

This article not intended to go deeper into analysis of Decision Tree. My goal in this tutorial is just to introduce you an important concept of ID3 algorithms which first introduced by John Ross Quinla at 1989. Later, he developed C4.5 algorithm which is improved version of ID3 algorithm. Then, the improved version of C4.5 algorithm is C5.0 algorithm. There is also another popular algorithm called as CART (Classification And Regression Trees).