Decision Trees in Machine Learning

Published in

DataX Journal

6 min readJul 24, 2020

“The possible solutions to a given problem emerge as the leaves of a tree, each node representing a point of deliberation and decision.”

Niklaus Wirth (1934 — ), Programming language designer

A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute, each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules. Though a commonly used tool in data mining for deriving a strategy to reach a particular goal, it also widely used in machine learning.
Decision Tree is about testing on attributes and branching the cases based on the result of the test.

The below diagram illustrates the basic flow of a decision tree for decision making with labels (Rain(Yes), No Rain(No)).

Decision tree algorithm falls under the category of supervised learning. They can be used to solve both regression and classification problems, but mostly it is preferred for solving Classification problems.

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the given dataset is the main point to remember while creating a machine learning model. The basic idea behind a Decision Tree is to map out all the possible decision paths in the form of a tree. Below are the two reasons for using the Decision tree:

Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
The logic behind the decision tree can be easily understood because it shows a tree-like structure.

Common terms used with Decision trees:

Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called a decision node.
Leaf/ Terminal Node: Nodes do not split is called Leaf or Terminal node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child nodes of the parent node.

Decision Tree Learning Algorithm

A Decision Tree is constructed by considering the attributes one by one.
1. Choose an attribute from your dataset.
2. Calculate the significance of the attribute in the splitting of data.
3. Split the data based on the value of the last attribute.
4. Go to Step 1 (go to each branch and repeat it for the rest of the attributes)
After building this you can use it to predict the class of unknown cases.

Let us understand this with the help of an example:

Example: Suppose there is a candidate who has a job offer and wants to decide whether he should accept the offer or not. So, to solve this problem, the decision tree starts with the root node (Salary attribute by ASM). The root node splits further into the next decision node (distance from the office) and one leaf node based on the corresponding labels. The next decision node further gets split into one decision node (Cab facility) and one leaf node. Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:

Attribute Selection Measures

In the Decision tree, the major challenge is the identification of the attribute for the root node in each level. This process is known as attribute selection. We have two popular attribute selection measures:
1. Information Gain
2. Gini Index

Information Gain

Information gain is used to decide which feature to split at each step of building the tree. Simplicity is best, so we want to keep our tree small. To do so, at each step we should choose the split that results in the purest daughter nodes. A commonly used measure of purity is called information. For each node of the tree, the information value measures how much information a feature gives us about the class. The split with the highest information gain will be taken as the first split and the process will continue until all children nodes are pure, or until the information gain is 0.
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy
Entropy is the measure of uncertainty of a random variable, it characterizes the impurity of an arbitrary collection of examples. The higher the entropy more the information content. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,
S= Total number of samples
P(yes)= probability of yes
P(no)= probability of no

Gini Impurity

First, let’s understand the meaning of Pure and Impure.

Pure

Pure means, in a selected sample of the dataset, all data belongs to the same class.

Impure

Impure means data is a mixture of different classes.

Definition of Gini Impurity

Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm.
If our dataset is Pure then the likelihood of incorrect classification is 0. If our sample is a mixture of different classes then the likelihood of incorrect classification will be high.

Gini Impurity is preferred to Information Gain because it does not contain logarithms which are computationally intensive.

Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree to get the optimal decision tree.
A large tree increases the risk of overfitting, and a small tree may not capture all the important features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is known as Pruning. There are mainly two types of tree pruning technology used:

Cost Complexity Pruning
Reduced Error Pruning.

Working with Decision Trees in Python:

#Import Library
# Import other necessary libraries like pandas, NumPy…

from sklearn import tree

# Assumed you have, X (predictor) and Y (target) for training data set and x_test (predictor) of test_dataset

# Create a tree object

model = tree.DecisionTreeClassifier(criterion=’gini’)
# for classification, here you can change the algorithm as Gini or entropy (information gain), by default it is gini

# model = tree.DecisionTreeRegressor() for regression

# Train the model using the training sets and check score

model.fit(X, y)

model.score(X, y)

#Predict Output

predicted = model.predict(x_test)

Advantages of the Decision Tree

It is simple to understand as it follows the same process which a human follows while making any decision in real-life.
Can handle both categorical and numerical data.
Resistant to outliers, hence require little data preprocessing.

Disadvantages of the Decision Tree

The decision tree contains lots of layers, which makes it complex.
Prone to overfitting.
Require some kind of measurement as to how well they are doing.
Need to be careful with parameter tuning.
Can create biased learned trees if some classes dominate.

Summary :

Not all problems can be solved with linear methods. The world is non-linear. It has been observed that tree-based models have been able to map non-linearity effectively. Methods like decision trees, random forest, gradient boosting are being popularly used in all kinds of data science problems.

“When your values are clear to you, making decisions easier.” — Roy E. Disney