# An Introduction to Decision Tree Learning: ID3 Algorithm

This model is very simple and easy to implement. But, if you like to get more insight, below I give you some important prerequisite related to this model.

Prerequisite:

- Data structure (Tree)

- Searching algorithms (greedy algorithm, heuristic search, hill climbing, alpha-beta pruning)

- Logic (OR, AND rules)

- Probability (Dependent and Independent)

- Information Theory (Entropy)

Applications:

- Operational Research

- Finance

- Scheduling problems

- etc

In this episode of Decision Tree, I will give you complete guide to understand the concept behind Decision Tree and how it work using an intuitive example. In the next episodes, I will show you the easiest way to implement Decision Tree in Python using **sklearn** library and R using **C50** library (an improved version of ID3 algorithm).

## Motivation

Decision Tree is very very very simple model. It is widely known and used in many businesses to support decision making process and risk analysis. It is also one of legendary learning model which is heavily used in ’60–80’s to build expert systems. One of very popular expert systems which adopt decision tree ( almost cs and informatics student knew about) is Mycin which developed at 1970 by Buchanan and Cohen. But, like another classic expert systems, Mycin is not fully automatic operated. At that years, human experts still needed to input hard coded rules into expert systems. After 80’s, this model has been lost popularity since it’s seems can not be extended using more sophisticated mathematics.

But now, Decision Tree learning start gaining popularity since some machine learning practitioners proved that inferior algorithm with bigger data may beats sophisticated algorithm. Beside that, it is worth to learn Decision Tree learning model at first place, before jump into more abstract models, such as, Neural Network and SVM (Support Vector Machine). By learning Decision Tree, you will have better insight how to implement basic probability theory and how to transform basic searching algorithm into machine learning algorithm.

In this episode, I will introduce you **ID3** method which is one of widely used decision tree algorithm.

## What is Decision Tree in One Sentence?

Simply, it is enhanced greedy search algorithm that implement heuristic function using probability as comparison values, but does not implement *backtracking** *(because it is greedy!).

## Why Decision Tree in Two Sentences?

Suppose, you want to make a strategy which is contains many decisions, then it is *not* *good* *idea* to use brute force algorithm to decide which is best decision combination. So, it is *good idea* to implement decision tree algorithm which use heuristic function to choose good decision combination.

**When Use Decision Tree Learning?**

- The most commonly reasonable reason to used Decision Tree is when you realized that
*Naive Bayes*will not satisfy your goal. Everybody knew that it is*naive*to seeing all features as independent variables. - If we have categorical data, for example: hot, mild, and cold.
- If we have attribute-value pairs data, for example: an attribute ‘temperature’ may has some discrete values ‘hot’, ‘mild’, and ‘winter’.
- If objective value has discrete output values, for example ‘yes’, or ‘no’.

**Mathematical Tool to Use**

Basically, we only need two mathematical tool to implement complete ID3 algorithms:

**Entropy**

It is a fundamental theorem which commonly used in information theory to measure important of information relative to its*size*. Let x is our training set contains positive and negative examples, then the entropy of x relative to this classification is:

**Information Gain**In multivariate calculus, we have learn how to use a partial derivative of each variable relative to all other variables to find local optimum. In information theory, we used similar concept, we derive the original entropy of population to measure information gain of each attribute.

For training set x and its attribute y, the formula of Information Gain is:

I will give you an easy example in order to make sense the formula above. Suppose we face with binary classification ‘yes’ or ‘no’, then we label of bit 1 for yes, and label of bit 0 for no. One of our feature’s attributes is ‘Outlook’ (O) which has three possible values ‘Sunny’ (Os), ‘Overcast’ (Oo), and ‘Rain’ (Or). Then, the information gain of Outlook is:

## ID3 Algorithm

For simplicity, I choose to write ID3 algorithm using pseudo code because it is more efficient and cleaner. Actually pseudo code format easier to read, although for who not learn algorithm before.

# x is examples in training set

# y is set of attributes

# labels is labeled data

# Node is a class which has properties values, childs, and next

# root is top node in the decision tree# Declare:

x = # Multi dimensional arrays

y = # Column names of x

labels = # Classification values, for example {0, 1, 0, 1}

# correspond that row 1 is false, row 2 is true, and so on

root = ID3(x, y, label, root)# Define:

ID3(x, y, label, node)

initialize node as a new node instance

if all rows in x only have single classification c, then:

insert label c into node

return node

if x is empty, then:

insert dominant label in x into node

return node

bestAttr is an attribute with maximum information gain in x

insert attribute bestAttr into node

for vi in values of bestAttr:

// For example, Outlook has three values: Sunny, Overcast, and Rain

insert value vi as branch of node

create viRows with rows that only contains value vi

if viRows is empty, then:

this node branch ended by a leaf with value is dominant label in x

else:

newY = list of attributes y with bestAttr removed

nextNode = next node connected by this branch

nextNode = ID3(viRows, newY, label, nextNode)

return node

The python version of pseudo code above can be found at github. If you struggle with how to implement ID3 algorithm, then it worth to play with python version of pseudo code above. I also provides a data set that can be used for testing purpose.

## An Intuitive Example: What the best day to Play Tennis?

Consider a table of data set below. Given the column “Play Tennis” as **target attribute**, and example of days which such conditions as attributes: *Outlook*; *Temperature*; *Humidity*; and *Wind*. We want to know what the best day to play tennis.

We want to construct a decision tree of the training set above using ID3 algorithm which has been described before.

# Declaration

For simplicity, we will give name of each attribute in data set:

`O is Outlook attribute`

T is Temperature attribute

H is Humidity attribute

W is Wind attribute

# First iteration

At first iteration, we need to know which is best attribute to be chosen as top root in our decision tree. To do that, ID3 will find the *best attribute* which is has maximum information gain. Given the information gain for each attribute:

`attributes = [O, H, W, T]`

G(x, O) = 0.246

G(x, H) = 0.151

G(x, W) = 0.048

G(x, T) = 0.029

So, based on the information gains calculated above, we choose attribute *Outlook* as first root node which has three branches *Sunny*, *Rain*, and *Overcast*. Our decision tree will look like an image below.

For the next iterations, we remove *O* from attribute list.

# Second iteration

We want examine which the best attribute for branch of *Sunny*. Remember, that new x is rows contains values of *Sunny*.

`x = [x[0], x[1], x[7], x[8], x[10]]`

attribute= [H, W, T]

G(x, H) = .970 - (3/5) * 0 - (2/5) * 0 = .970

G(x, T) = .970 - (2/5) * 0 - (2/5) * 1 = .570

G(x, W) = .970 - (2/5) * 1 - (3/5) * .918 = .019

So, based on the information gains calculated above, we choose attribute *Humidity* as attribute in branch *Sunny*. Our decision tree will look like an image below.

For the next iterations, we remove *H* from attribute list.

# Third iteration

Next node is an attribute *Humidity* which has two possible values {*High, Normal*}. A branch *High *dominated by single label which is *No*, caused this branch ended with a leaf contains label *No*. Same case with branch *Normal* which ended with a leaf contains label *Yes*.

# Four iteration

All rows contains value *Outcast* are dominated by single label *Yes*, so branch of *Overcast* ended with a leaf contains label *Yes.*

# Fifth iteration

We want examine which the best attribute for branch of *Rain*. Remember, that new x is rows contains values of *Rain*.

`x = [x[3], x[4], x[5], x[9], x[13]]`

attribute = [W, T]

G(x, W) = .970 - (2/5) * 0 - (3/5) * 0 = .970

G(x, T) = .970 - (0/5) * 0 - (3/5) * .918 - (2/5) * 1.0 = .019

So, based on the information gains calculated above, we choose attribute *Wind* as attribute in branch of *Rain*. Our decision tree will look like an image below.

For next iteration, we remove attribute *W* from attribute list

# Sixth iteration

Next node is an attribute *Wind* which has two possible values {*Weak, Strong*}. A branch *Strong *dominated by single label which is *No*, caused this branch ended with a leaf contains label *No*. Same case with branch *Weak* which ended with a leaf contains label *Yes*.

# Seven iteration

Actually, there is no iteration left, since all branches in our decision tree ended with leafs. In other word, we prune attribute *Temperature* from our decision tree.

## Conclusion

Decision tree is a very simple model that you can build from starch easily. One of popular Decision Tree algorithm is ID3. Basically, we only need to construct tree data structure and implements two mathematical formula to build complete ID3 algorithm. The complete implementation of ID3 algorithm in Python can be found at github.

## Suggestion

This article not intended to go deeper into analysis of Decision Tree. My goal in this tutorial is just to introduce you an important concept of ID3 algorithms which first introduced by John Ross Quinla at 1989. Later, he developed C4.5 algorithm which is improved version of ID3 algorithm. Then, the improved version of C4.5 algorithm is C5.0 algorithm. There is also another popular algorithm called as CART (Classification And Regression Trees).