Decision Tree

Classification is a form of data analysis that extracts models describing important data classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels. For example,we can build a classification model to categorize bank loan applications as either safe or risky. Such analysis can help provide us with a better understanding of the data at large. In this case the data analysis task is classification, where a model or classifier is constructed to predict class (categorical) labels, such as “safe” or “risky” for the loan application data. Regression analysis is a statistical methodology that is most often used for numeric prediction. Classification and numeric prediction are the two major types of prediction problems.

General Approach to Classification

Data classification is a two-step process, consisting of a learning step (where a classification model is constructed) and a classification step (where the model is used to predict class labels for given data).

Because the class label of each training tuple is provided, this step is also known as supervised learning (i.e., the learning of the classifier is “supervised” in that it is told to which class each training tuple belongs). It contrasts with unsupervised learning (or clustering), in which the class label of each training tuple is not known, and the number or set of classes to be learned may not be known in advance.

Decision Tree Classifier

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

How are decision trees used for classification ?

Given a tuple(row in data set table), X, for which the associated class label is unknown, the attribute values of the tuple are tested against the decision tree. A path is traced from the root to a leaf node, which holds the class prediction for that tuple. Decision trees can easily be converted to classification rules.

Why are decision tree classifiers so popular ?

Decision tree construction does not involve any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery. Decision trees can handle multidimensional data.

During tree construction, attribute selection measures are used to select the attribute that best partitions the tuples into distinct classes. When decision trees are built, many of the branches may reflect noise or outliers in the training data. Tree pruning attempts to identify and remove such branches, with the goal of improving classification accuracy on unseen data. The algorithm is called with three parameters: D, attribute list, and Attribute selection method. We refer to D as a data partition. Initially, it is the complete set of training tuples and their associated class labels. The parameter attribute list is a list of attributes describing the tuples. Attribute selection method specifies a heuristic procedure for selecting the attribute that “best” discriminates the given tuples Information gain or the Gini index. Attribute selection is used to determine the splitting criterion which tells us which attribute to test at node N by determining the “best” way to separate or partition the tuples in D into individual classes.

Information Gain

The attribute with the highest information gain is chosen as the splitting attribute for node N. This attribute minimizes the information needed to classify the tuples in the resulting partitions. The expected information needed to classify a tuple in D(or entropy) is given by

where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is estimated by Ci,D/|D|. A log function to the base 2 is used, because the information is encoded in bits.Ideally, we would like this partitioning to produce an exact classification of the tuples. That is, we would like for each partition to be pure. However, it is quite likely that the partitions will be impure (e.g., where a partition may contain a collection of tuples from different classes rather than from a single class). How much more information would we still need (after the partitioning) to arrive at an exact classification?

Info A(D) is the expected information required to classify a tuple from D based on the partitioning by Attribute A. The smaller the expected information (still) required, the greater the purity of the partitions. Finally Gain is

In fact Gain(A) tells us how much would be gained by branching on A. The attribute A with the highest information gain, Gain.A/, is chosen as the splitting attribute at node N.

Lets have an example

Let class C1(class 1) correspond to yes and class C2 correspond to no. There are nine tuples of class yes and five tuples of class no. A (root) node N is created for the tuples in D.

Next, we need to compute the expected information requirement for each attribute. Let’s start with the attribute age. For the age category “youth,” there are two yes tuples and three no tuples. For the category “middle aged,” there are four yes tuples and zero no tuples. For the category “senior,” there are three yes tuples and two no tuples. Using 2 result we have

Next lets calculate the Gain(age)

Similarly Gain for others can be calculated

Because age has the highest information gain among the attributes, it is selected as the splitting attribute. Node N is labeled with age, and branches are grown for each of the attribute’s values.

Gain Ratio

The information gain measure is biased toward tests with many outcomes. That is, it prefers to select attributes having a large number of values. extension to information gain known as gain ratio, which attempts to overcome this bias. It applies a kind of normalization to information gain using a “split information” value defined analogously with Info(D) as

This value represents the potential information generated by splitting the training data set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. It differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as

Gain Ratio(A) = Gain(A)/ Split Info(A)

The attribute with the maximum gain ratio is selected as the splitting attribute.

Gain_ratio(income) = 0.029/1.557 = 0.019

Gini Index

The Gini index measures the impurity of D, a data partition or set of training tuples, as

If a data set D is split on A into two subsets D1 and D2, the Gini index Gini(D) is defined as

Reduction in Impurity:

The attribute provides the smallest Gini split(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute). D has 9 tuples in buys_computer = “yes” and 5 in “no”

Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2

Gini {low,high} is 0.458; Gini {medium,high} is 0.450. Thus, split on the {low,medium}since it has the lowest Gini index.

Comaprison

The three measures, in general, return good results but

Information Gain :Biased towards multi valued attributes.

Gain ratio:Tends to prefer unbalanced splits in which one partition is much smaller than the others

Gini index:Biased to multi valued attributes, has difficulty when # of classes is large, tends to favor tests that result in equal-sized partitions and purity in both partitions.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade