Chapter 3 — Decision Tree Learning — Part 1

13 min readFeb 15, 2020

Here I am writing notes for the third chapter “Decision Tree learning” from book “Machine Learning — Tom Mitchell”. Hope this story, will help them, who are studying Machine Learning in their university or willing to understand the theoretical fundamentals of machine learning.

Chapter 1–3-steps-introduction-to-machine-learning-and-design-of-a-learning-system
Chapter 2 — Concept Learning — Part 1
Chapter 2 — Concept Learning — Part 2
Chapter 2 — Inductive bias — Part 3
Chapter 3 — Decision Tree Learning — Part 1
Chapter 3 — Decision Tree Learning — Part 2 — Issues in decision tree learning

Decision trees, one of the simplest and yet most useful Machine Learning method. Decision trees, as the name implies, are trees of decisions. A decision tree can be used to visually and explicitly represent decisions and decision making.
Take for example the decision about what activity you should do this weekend. It might depend on whether or not you feel like going out with your friends or spending the weekend alone; in both cases, your decision also depends on the weather. If it’s sunny and your friends are available, you may want to play soccer. If it ends up raining you’ll go to a movie. And if your friends don’t show up at all, well then you like playing video games no matter what the weather is like!. These decisions can be represented by a tree as below.

In my earlier story — Introduction to machine learning, I have described about the learning process. The learning process starts with task T, performance measure P and training experience E and objective is to find an unknown target function. The target function is an exact knowledge to be learned from the training experience and its unknown.

A target function can be represented as f:X →y. This function represents the exact knowledge defining the relationship between input variable X and output variable y. Expectation is, a learning algorithm, learn the target function from the training examples but it learns approximate of target function.

In general, the target function could be real valued or discrete valued.

If the task of approximating a mapping function (f) from input variables (X) to discrete output variables (y), then such learning problems are called classification. The output variables are often called labels or categories. The mapping function predicts the class or category for a given observation. For example, an email of text can be classified as belonging to one of two classes: “spam“ and “not spam“.
If the task of approximating a mapping function (f) from input variables (X) to real output variables (y), then such learning problems are called regression. A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes. For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.

The Decision tree learning is a method for approximating discrete-valued target functions, in which the learned function is represented by a decision tree. Learned trees can also be re-represented as sets of if-then rules to improve human readability. These learning methods are among the most popular of inductive inference algorithms and have been successfully applied to a broad range of tasks from learning to diagnose medical cases to learning to assess credit risk of loan applicants.

1. Decision tree representation

A decision tree is drawn upside down with its root at the top. In the below figure, the bold text in black represents a condition/internal node, based on which the tree splits into branches/ edges.
The end of the branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived, represented as red and green text respectively.
So the Decision trees classify instances by sorting them down the tree from the root to some leaf node, which provides the classification of the instance.

Each node in the tree specifies a test of some attribute of the instance, and each branch descending from that node corresponds to one of the possible values for this attribute.
An instance is classified by starting at the root node of the tree, testing the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in the given example. This process is then repeated for the subtree rooted at the new node.

Take one more example. The below figure is a learned decision tree for the concept PlayTennis. An example is classified by sorting it through the tree to the appropriate leaf node, then returning the classification associated with this leaf.

What is the classification for the below instance ? and whether the day represented by the instance suits for playing tennis or not ?

(Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong)

Based on the learnt decision tree, the instance is classified to be negative instance (i.e., the tree predicts that PlayTennis = no).

In general, decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. For example, the decision tree shown above corresponds to the below expression.

(Outlook = Sunny Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain A Wind = Weak)

Appropriate problems for decision tree learning

The variety of decision tree learning methods have been developed and they have differing capabilities and requirements. The decision tree learning is generally best suited to problems with the following characteristics:

Instances are represented by attribute-value pairs. Instances are described by a fixed set of attributes (e.g., Temperature) and their values (e.g., Hot). The easiest situation for decision tree learning is when each attribute takes on a small number of disjoint possible values (e.g., Hot, Mild, Cold). However, extensions to the basic algorithm allow handling real-valued attributes as well (e.g., representing Temperature numerically).
The target function has discrete output values. The decision tree assigns a boolean classification (e.g., yes or no) to each example. Decision tree methods easily extend to learning functions with more than two possible output values. A more substantial extension allows learning target functions with real-valued outputs, though the application of decision trees in this setting is less common.
Disjunctive descriptions may be required. As noted above, decision trees naturally represent disjunctive expressions.
The training data may contain errors. Decision tree learning methods are robust to errors, both errors in classifications of the training examples and errors in the attribute values that describe these examples.
The training data may contain missing attribute values. Decision tree methods can be used even when some training examples have unknown values (e.g., if the Humidity of the day is known for only some of the training examples).

Many practical problems have been found to fit these characteristics. Decision tree learning has therefore been applied to problems such as learning to classify medical patients by their disease, equipment malfunctions by their cause, and loan applicants by their likelihood of defaulting on payments. Such problems, in which the task is to classify examples into one of a discrete set of possible categories, are often referred to as classification problems.

2. Basic decision tree learning algorithm

The most algorithms that have been developed so far for learning decision trees are variations on a core algorithm that employs a top-down, greedy search through the space of possible decision trees.

This approach is demonstrated by the ID3 algorithm

ID3 basic algorithm, learns decision trees by constructing them top-down, beginning with the question “which attribute should be tested at the root of the tree?” .

To answer this question, each instance attribute is evaluated using a statistical test to determine how well it alone classifies the training examples. The best attribute is selected and used as the test at the root node of the tree.
A descendant of the root node is then created for each possible value of this attribute, and the training examples are sorted to the appropriate descendant node (i.e., down the branch corresponding to the example’s value for this attribute).
The entire process is then repeated using the training examples associated with each descendant node to select the best attribute to test at that point in the tree. This forms a greedy search for an acceptable decision tree, in which the algorithm never backtracks to reconsider earlier choices.
A simplified version of the algorithm, specialized to learning boolean-valued functions (i.e., concept learning), is described in below.

Algorithm ID3 (Examples, TargetAttribute, Attributes)1. Examples are the training examples.
2. Targetattribute is the attribute whose value is to be predicted by the tree. 
3. Attributes is a list of other attributes that may be tested by the learned decision tree. 
4. Returns a decision tree that correctly classifies the given Examples.1. Create a Root node for the tree
2. If all Examples are positive, Return the single-node tree Root, 
   with label = +
3. If all Examples are negative, Return the single-node tree Root, 
   with label = -
4. If Attributes is empty,
   - Return the single-node tree Root, with label = most common
     value of TargetAttribute in Examples
   Otherwise
   Begin
      – A ← the attribute from Attributes that best classifies  
        Examples
      – The decision attribute for Root ← A
      – For each possible value vi of A,
        • Add a new tree branch below Root, corresponding to the   
          test A = vi
        • Let Examples_vi be the subset of Examples that have value 
          vi for A
        • If Examples_vi is empty Then
          – below this new branch add a leaf node with label = most 
            common value of TargetAttribute in Examples
          Else
          – below this new branch add the subtree
            ID3(Examples_vi , TargetAttribute, Attributes – {A})
  End
5. Return Root

Which Attribute Is the Best Classifier

The central choice in the ID3 algorithm is selecting attribute that is most useful for classifying examples.
We will define a statistical property, called information gain, that measures how well a given attribute separates the training examples according to their target classification.
ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.

Definition: Entropy is the measures of impurity, disorder or uncertainty in a bunch of examples.
Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this boolean classification is

Where,
p+ is the proportion of positive examples in S
p- is the proportion of negative examples in S.

Suppose S is a collection of 14 examples of some boolean concept, including 9 positive and 5 negative examples (we adopt the notation [9+, 5-] to summarize such a sample of data). Then the entropy of S relative to this boolean classification is

Entropy(S) = — (9/14) log_2 (9/14) — (5/14)log_2(5/14)
Entropy(S) =0.940

Notice —
1. The entropy is 0 if all members of S belong to the same class. For example, if all members are positive (p+ve = 1), then p-ve, is 0, and
Entropy(S) = — (1) log_2 (1) — (0)log_2(0) = 0
2. The entropy is 1 when the collection contains an equal number of positive and negative examples.
3. If the collection contains unequal numbers of positive and negative examples, the entropy is between 0 and 1.

The below Figure shows the form of the entropy function relative to a boolean classification, as p, varies between 0 and 1.

Interpretation of entropy —
One interpretation of entropy from information theory is that it specifies the minimum number of bits of information needed to encode the classification of an arbitrary member of S.

NOTE - 
What are the possible outcomes of tossing 5 coins ?
The following possibilities exist:
1. 5 Head, 0 Tail
2. 4 Head, 1 Tail
3. 3 Head, 2 Tail
4. 2 Head, 3 Tail
5. 1 Head, 4 Tail
6. 0 Head, 5 TailLet X be a random number representaing number of heads. The Range(X) will be {0,1,2,3,4,5}
So the probability distribution of X looks as below :
P(X=0)= p1 = 1/32
P(X=1)= p1 = 5/32
P(X=2)= p3 = 10/32
P(X=3)= p4 = 10/32
P(X=4)= p5 = 5/32
P(X=5)= p6 = 1/32So in this case, what is the smallest possible number of bits, on average, per symbol, needed to transmit a stream of symbols drawn from X's distribution ?H(X) = Entropy of X = -p1 log2(p1)-p2 log2(p2)-p3 log2(p3)-p4 log2(p4)-p5 log2(p5)
H(X) = Entropy of X = 2[-1/32 log2(1/32)] +2[-5/32 log2(5/32)] +2[-10/32 log2(10/32)]
H(X) = Entropy of X = = 2.198

For example, if p+ = 1, the receiver knows the drawn example will be positive, so no message need be sent, and the entropy is zero.
On the other hand, if p+ = 0.5, one bit is required to indicate whether the drawn example is positive or negative.
If p+ = 0.8, then a collection of messages can be encoded using on average less than 1 bit per message by assigning shorter codes to collections of positive examples and longer codes to less likely negative examples.

3. Information Gain

Given entropy as a measure of the impurity in a collection of training examples, we can now define a measure of the effectiveness of an attribute in classifying the training data. The measure we will use, called information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.

The information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as -

Where S — a collection of examples; A — an attribute; Values(A) — possible values of attribute A; Sv — the subset of S for which attribute A has value v.

For example, suppose S is a collection of training-example days described by attributes including Wind, which can have the values Weak or Strong. As before, assume S is a collection containing 14 examples, [9+, 5-]. Of these 14 examples, suppose 6 of the positive and 2 of the negative examples have Wind = Weak, and the remainder have Wind = Strong. The information-gain due to sorting the original 14 examples by the attribute Wind may then be calculated as

Information gain is precisely the measure used by ID3 to select the best attribute at each step in growing the tree. The use of information gain to evaluate the relevance of attributes is summarized in the below figure. The information gain of two different attributes, Humidity and Wind, is computed in order to determine which is the better attribute for classifying the training examples.

In the above figure, the Humidity provides greater information gain than Wind, relative to the target classification. Here, E stands for entropy and S for the original collection of examples. Given an initial collection S of 9 positive and 5 negative examples, [9+, 5–], sorting these by their Humidity produces collections of [3+, 4–] (Humidity = High) and [6+, 1–] (Humidity = Normal). The information gained by this partitioning is 0.151, compared to a gain of only .048 for the attribute Wind.

4. An Illustrative Example

To illustrate the operation of ID3, consider the learning task represented by the training examples of below table. Here the target attribute PlayTennis, which can have values yes or no for different days. Consider the first step through the algorithm, in which the topmost node of the decision tree is created.

ID3 determines the information gain for each candidate attribute (i.e., Outlook, Temperature, Humidity, and Wind), then selects the one with highest information gain.

We have already computed the information gain for the attributes Humidity and Wind, Now lets compute the IG for Outlook and Temperature attributes.

Attribute Outlook —
Outlook_Rain = {2+, 3 — }
Outlook_Sunny = {3+,2 — }
Outlook_Overcast = {4+,0 — }

Gain(S,Outlook) = Entropy(S) — 5/14 Entropy(S_Outlook_Rain) — 5/14 Entropy(S_Outlook_Sunny) — 4/14 Entropy(S_Outlook_Overcast)
Gain(S,Outlook) = 0.246

Attribute Temperature —
Temperature_Hot = {2+, 2— }
Temperature_Mild = {4+,2 — }
Temperature_Cool = {3+,1 — }

Gain(S,Temperature) = Entropy(S) — 4/14 Entropy(S_Temperature_Hot) — 6/14 Entropy(S_Temperature_Mild) — 4/14 Entropy(S_Temperature_Cool)
Gain(S,Temperature) = 0.029

So the information gain for all the attributes are as follows :
1. Gain(S,Outlook) = 0.246
2. Gain(S,Temperature) = 0.029
3. Gain(S,Wind) = 0.048
4. Gain(S,Humidity) = 0.151

According to the information gain measure, the Outlook attribute provides the best prediction of the target attribute, PlayTennis, over the training examples. Therefore, Outlook is selected as the decision attribute for the root node, and branches are created below the root for each of its possible values (i.e., Sunny, Overcast, and Rain). The resulting partial decision tree is shown in Figure, along with the training examples sorted to each new descendant node.

Note that every example for which Outlook = Overcast is also a positive example of PlayTennis. Therefore, this node of the tree becomes a leaf node with the classification PlayTennis = Yes. In contrast, the descendants corresponding to Outlook = Sunny and Outlook = Rain still have nonzero entropy, and the decision tree will be further elaborated below these nodes.

The process of selecting a new attribute and partitioning the training examples is now repeated for each non terminal descendant node, this time using only the training examples associated with that node. Attributes that have been incorporated higher in the tree are excluded, so that any given attribute can appear at most once along any path through the tree. This process continues for each new leaf node until either of two conditions is met: (1) every attribute has already been included along this path through the tree, or
(2) the training examples associated with this leaf node all have the same target attribute value (i.e., their entropy is zero).

Now lets compute the IG gain for S_Sunny = {D1,D2,D8,D9,D11}

Gain(S_Sunny,Temperature) = Entropy(S_Sunny) — 4/5 Entropy(S_Sunny_Temperature_Hot) — 6/5 Entropy(S_Sunny_Temperature_Mild) — 4/5 Entropy(S_Sunny_Temperature_Cool)