Decision Tree Classification

6 min readJun 12, 2020

What is a decision tree?

A decision tree is a supervised technique of machine learning widely used almost in every model as it gives better accuracy and its attributes are independent of each other. As per the name, it breaks the data into subsets and forms a tree-like structure.

Decision trees are of two types Classification and Regression. We will discuss “Decision tree Classification” and for the “Decision Tree Regression”.

What decision tree classification does is that it compares the attributes of all the attributes with the outcome and finds the best-fit attribute as a root node and others as their child node.

It can do this using two methods:-

Iterative Dichotomiser 3(ID3)
Using the Gini Index

To understand the ID3 method, how it does this you need to understand two concepts that are information gain and entropy.

Let’s understand the Entropy first.

Entropy is a measure of the randomness in the information being processed. The higher the entropy, the more difficult it is to draw any conclusions from that information. Flipping a coin is an example of an action that presents information that is random.

Information Gain

Information gain is the loss in entropy or surprise by transforming a dataset and is often used in training decision trees. Information gain is measured by comparing the entropy of the dataset before and after a transformation.

Now let’s see how it works using an example

Let’s just take a popular dataset which is weather dataset(playing game Y or N based on weather condition).

We have four X values (outlook, temp, humidity, and windy) being categorical and one y value (play Y or N) also being categorical. This is a binary classification problem.

To create a tree, we need to have a root node first and we know that nodes are features/attributes.

To determine the attribute that best classifies the training data; use this attribute at the root of the tree. Repeat this process for each branch.

This means we are performing a top-down, greedy search through the space of possible decision trees.

How do we choose the best attribute?

Use the attribute with the highest information gain in ID3.

To define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the impurity of an arbitrary collection of examples.

For a binary classification problem

If all examples are positive or all are negative then entropy will be zero i.e, low.
If half of the examples are of positive class and half are of negative class then entropy is one i.e, high.

Okay let’s use these metrics to our dataset to split the data(getting the root node)

Steps:

1.compute the entropy for data-set2.for every attribute/feature:
       1.calculate entropy for all categorical values
       2.take average information entropy for the current attribute
       3.calculate gain for the current attribute3. pick the highest gain attribute.
4. Repeat until we get the tree we desired.

Compute the entropy for the weather data set:

For every feature calculate the entropy and information gain.

Similarly, we can calculate for the other two attributes(Humidity and Temp).

Pick the highest gain attribute.

So our root node is Outlook.

Repeat the same thing for sub-trees till we get the tree.

Finally, we get the tree something like his.

Classification using the Gini Index

We use the Gini Index as our cost function used to evaluate splits in the dataset. Our target variable is Binary variable which means it takes two values (Yes and No). There can be 4 combinations.

Actual=1 predicted 1
1 0 , 0,1, 0 0P(Target=1).P(Target=1) + P(Target=1).P(Target=0) + P(Target=0).P(Target=1) + P(Target=0).P(Target=0) = 1P(Target=1).P(Target=0) + P(Target=0).P(Target=1) = 1 — P^2(Target=0) — P^2(Target=1)

Gini Index for Binary Target variable is

= 1 — P²(Target=0) — P²(Target=1)

What Gini Index do?

A Gini score provides a concept of how great a split is by how mixed the classes are in the two groups created by the split. A perfect mixed result in a Gini score of 0, whereas the worst-case split that results in 50/50 classes.

We calculate it for every row and split the data accordingly in our binary tree. We repeat this process recursively.

For Binary Target variable, Max Gini Index value

= 1 — (1/2)² — (1/2)²
= 1–2*(1/2)²
= 1- 2*(1/4)
= 1–0.5
= 0.5

Similarly if Target Variable is a categorical variable with multiple levels, the Gini Index will be still similar. If Target variable takes k different values, the Gini Index will be

The maximum value of the Gini Index could be when all target values are equally distributed.

Similarly, for Nominal variable with k level, the maximum value Gini Index is

= 1–1/k

The minimum value of the Gini Index will be 0 when all observations belong to one label.

Steps:

1.compute the gini index for data-set2.for every attribute/feature:
       1.calculate gini index for all categorical values
       2.take average information entropy for the current attribute 
       3.calculate the gini gain3. pick the best gini gain attribute.
4. Repeat until we get the tree we desired.

The calculations are comparable to ID3, except for the formula modifications. For example: compute the Gini index for the dataset.

Similarly, we can grasp other steps to build the tree.

This is how we use the decision tree, and the is an extension of decision trees that is Random Forest that we will discuss later.

— — — — — — — — — — — — — -THANK YOU — — — — — — — __

Decision Tree Classification

Written by Rishabh Jain