# A Gentle Introduction to Decision Tree Learning

Trees have a great role in our life. Life will be very difficult without trees or we can say that life would be finished without trees because trees are the most important aspect of a healthy and wealthy life. Likewise, some parts of Classification and Regression problems won’t be this simple without Decision Tree Learning.

## Introduction

Decision tree learning is a method for approximating discrete-valued target functions, which is a part of supervised learning, where we learn a mapping from input to output by analyzing examples for which true values are given by a supervisor or a teacher or a human being which is generally treated as Train-data set and the model trained on train data set is used to classify unseen data, the learned function is represented by a decision tree.

## Decision Trees

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a Decision Tree which has a flowchart-like tree structure, with ** decision nodes **and

**where each decision node**

*leaf nodes*,**or**

**internal node (non-leaf node) denotes a test on an attribute, it has two or more branches and each branch represents an outcome of the test and each leaf node (or terminal node) holds a class label, represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor is called the**

**. Decision trees can handle both categorical and numerical data.**

*root node*## Types of Decision Trees

Types of Decision tree depends on the kind of target variable we have. It can be of two types:

**Categorical Variable Decision Tree:**Decision Tree with categorical target variable is called as a categorical variable Decision tree. The objective variable will be like YES or NO.

**2. Continuous Variable Decision Tree: **If the Decision Tree has a consistent target variable then it is called as Continuous Variable Decision Tree.

Decision trees are the representation of nested if-else conditions based upon which traversal along the tree will be done. A typical decision tree is shown below.

Geometrically, a decision tree is a set of Axis-parallel Hyperplanes which divides our region into cubes, cuboids or hypercubes and hypercuboids. Here we are dividing our dataset into 3 classes i.e, class 1,2 and 3. We are dividing the region into 3 parts using 2 threshold values a,b. These threshold values can be any value, it’s purely problem-specific.

If height<a THEN y_i=1

If height≥a AND weight<b THEN y_i=2

If height≥a AND weight≥b THEN y_i=3

## Building a Decision Tree

The central choice in the decision tree is selecting which attribute to test at each node in the tree. Constructing a decision tree is all about finding an attribute that returns the highest information gain, in order to define information gain precisely, a measure called entropy is used.

## Entropy

Entropy is the measure of disorder. Also defined as Unpredictability or Randomness of an attribute/feature. Entropy tells us how messy our data is. All the classes will be together in our initial data set like a ** messy kids room **and using Entropy and Information gain, attributes will be placed in different levels in a tree

*.*Decision Tree’s goal is to arrange all the data points in an order, according to their class labels. We try to separate our data and group the samples together in the classes they belong to. We know their label since we construct the trees from the training set. We try to maximize the purity of the groups as much as possible whenever we create a new node of the tree (meaning we cut our set in two).

p is the proportion of examples in Sample S. c is the classes available. Entropy lies between 0 and 1.

- If all the data points belong to same class then Entropy=0
- If 50% of the points belong to one class and another 50% of points belongs to another class then Entropy=1
- If 99% of data points belong to one class and only 1% of points belongs to another class then Entropy=0.0801

## Information Gain

Information gain measures how well a given attribute separates the training examples according to the target classification function.

where Values(A) is the set of all possible values for attribute A and S is the subset of S for which attribute A has value v. The core algorithm for building decision trees called ** ID3** by J. R. Quinlan which employs a top-down approach, greedy search through the space of possible branches with no backtracking. ID3 uses

**and**

*Entropy***to construct a decision tree. In a greedy search, we always make the choice that seems to be the best at that moment, which optimizes our**

*Information Gain***(either maximize or minimize). In decision trees objective function is Information Gain. Information gain should be as high as possible, the Decision**

*Objective Function***Trees algorithm will always try to maximize Information**

**gain. An attribute**

**with the highest Information**

**gain**

**will be tested/split first.**

## Gini Impurity

We can also use Gini Impurity rather than Entropy for calculating Information gain. This is done in a similar way to how information gain was calculated for entropy, except instead of taking a weighted sum of the entropies of each branch of a decision, we take a weighted sum of the Gini impurity**.**

Computing Gini Impurity is easy as compared to Entropy since squares are easily computable than logarithmic values, moreover, the maximum possible of Gini Impurity is 0.5 whereas maximum possible entropy value is 1.

## Constructing Decision Tree

Let’s start constructing a decision tree for “play Tennis” with a simple data set. Target class label is playing tennis and attributes are Outlook, Temp, Humidity, and Windy with categorical values.

** Step 1**: Calculate Entropy on Target class

*Step 2:* Dataset will split based upon the Information Gain of attributes.

Attribute “outlook ” is having the highest information gain, so the attribute outlook is the Root node. In the next, we have to compute Entropy(sunny). Now, we need to apply the same procedure to decide the root node for the leftmost subtree and for the rightmost subtree. However, observe that all the samples of the middle tree are related to the same class (Yes class). So, we can create a leaf node here and label it with Yes. The final decision tree for the given data set, S, is given below.

Since we had already made outlook as the Root node, Outlook attribute will be deleted and hence dataset size will be reduced after considering outlook as our Root node. Now we have to find the gain of sunny with other attributes like Temp., Humidity and Windy.

The second node is Humidity will highest Information Gain. When Outlook=Sunny and Humidity=High, the only possible value is play_tennis=No and when outlook=Sunny and Humidity=Normal, the only possible value is play_tennis=Yes

Likewise, all the attributes will be placed in the decision tree based upon their entropy and Information gain.

IF outlook = sunny AND humidity = high THEN play_tennis = no

IF outlook= sunny AND humidity=normal THEN play_tennis=yes

If outlook=overcast THEN play_tennis=yes

If outlook=rainy AND windy=strong THEN play_tennis=no

If outlook=rainy AND windy=weak THEN play_tennis=yes

**Decision Tree Implementation**

For implementing Decision Trees in python we have **DecisionTreeClassifier **in sklearn. Here we are loading ** iris** dataset and 80% of data is considered as train data and the remaining 20% of data as test data. In the following code, we are ‘entropy’ as the criterion value, we can also use ‘gini impurity’.

`#Importing required libraries`

import pandas as pd;import numpy as np

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

from sklearn.model_selection import train_test_split

#Loading the iris data

data = load_iris()

print('Classes in the dataset: ', data.target_names)

#Extracting data attributes

X = data.data

### Extracting target/ class labels

y = data.target

print('Number of datapoints in the dataset:', X.shape[0])

#Using the train_test_split to create train and test sets.

X_train = X[:int(len(X)*.80)] ; y_train = y[:int(len(y)*.80)]

X_test= X[int(len(X)*.80):] ; y_test= y[int(len(y)*.80):] #Importing the Decision tree classifier from the sklearn library.

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion = 'entropy',max_depth=5,min_samples_split=3, min_samples_leaf=2)

#Training the decision tree classifier.

print(dt.fit(X_train, y_train))

`# Calculate feature importances`

importances = dt.feature_importances_

# Sort feature importances in descending order

indices = np.argsort(importances)[::-1]

# Rearrange feature names so they match the sorted feature importances

names = [data.feature_names[i] for i in indices]

plt.figure()

plt.title("Feature Importance")

plt.bar(range(X.shape[1]), importances[indices])

plt.xticks(range(X.shape[1]), names, rotation=90)

plt.show()

From the above figure, we can conclude that attributes like petal width and petal length are contributing much when compared to sepal width and sepal length and hence petal width and petal length are more important than sepal width and sepal length.

`from PIL import Image`

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

names = [data.feature_names[i] for i in indices]

pos_probs_and_words = sorted(zip(dt.feature_importances_, names), reverse=True)[:]

print("------------------Feature Importance:---------------------")

for i in range(0,4,1):

print("Feature importance of the word='%s' is %f%%"%(pos_probs_and_words[i][1],pos_probs_and_words[i][0]*float(100)))

st=''

for i in range(0,4,1):

st+=(pos_probs_and_words[i][1])

st+=' '

st1=st.replace('cm', '')

#print(st1)

# Start with one review:

text = st1# Create and generate a word cloud image:

wordcloud = WordCloud(background_color='white',width=1200, height=1000).generate(text)# Display the generated image:

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

plt.show()

Sepal length and sepal widths are the most used features and the cloud word represents all the features available in the dataset.

`import graphviz `

from sklearn import tree #feature_names=model_bow.get_feature_names() ,

dot_data = tree.export_graphviz(dt, out_file=None,max_depth=4,label='all',class_names=data.target_names,feature_names=data.feature_names,node_ids=True,impurity=True,filled=True,rounded=True)

graph = graphviz.Source(dot_data)

#graph.render("3",view='True')

graph

In DecisionTree, depth is the Major hyperparameter. Deeply grown Decision trees result in overfitting of data where the depth will be too high, then there is a possibility that the number of points at leaf nodes decreases and when we have fewer points at leaf node chances of having noisy data increases, which decreases the interpretability of our model. If the tree is too shallow it leads to Underfitting. We can find the optimal Depth value which gives us minimum misclassification error cross-validation with cv=10. Other parameters like min_samples_split, min_samples_leaf can also be considered as hyperparameters. We can also find the best depth value using GridSeachCV and RandomSearchCV

from sklearn.metrics import accuracy_score,recall_score

from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt

depth_val=[1,2,3,4,5,6,9,12]

ac_scores=[]

for d in depth_val:

dt= DecisionTreeClassifier(max_depth=d,min_samples_split=3, min_samples_leaf=2)

scores4= cross_val_score(dt, X_train, y_train, cv=10, scoring='accuracy')

ac_scores.append(scores4.mean())

#RECALL

MSE3= [1 - x for x in ac_scores]

optimal_depth_re = depth_val[MSE3.index(min(MSE3))]

print('\nThe optimal depth value is %d.' % optimal_depth_re)

# plotting misclassification error when accuracy-score is considered

plt.plot(depth_val, MSE3)

for xy in zip(depth_val, np.round(MSE3,3)):

plt.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')

plt.xlabel('depth')

plt.ylabel('Misclassification Error')

plt.show()

print("the misclassification error for each depth=%d is : "%optimal_depth_re,np.round(MSE3,3))optimal_dt= DecisionTreeClassifier(max_depth=optimal_depth_re,min_samples_split=3, min_samples_leaf=2)

optimal_dt.fit(X_train, y_train)

pred = optimal_dt.predict(X_train)

re=accuracy_score(y_train,pred)*float(100)

print('\n Accuracy of the DecisionTree Classifier with depth= %d is %f%% on train data' % (optimal_depth_re,re))# predict the response

pred = optimal_dt.predict(X_test)

re=accuracy_score(y_test,pred)*float(100)

print('\n Accuracy of the DecisionTree Classifier with depth= %d is %f%% on test data' % (optimal_depth_re,re))

Optimal depth=3 is selected by cross-validation and Accuracy on train data is 97.5% and accuracy on test data is 70%. Extension for Decision Trees are Random Forest and Gradient Boosting Decision Trees(GBDT)

## Cross-Validation using GridSearchCV

`#Using GridSearchCV`

tuned_parameters = [{'max_depth': [1,2,3,4,5,10,13,15]},{'min_samples_split':[2,3,4,5]},{'min_samples_leaf':[1,2,3,4,5]}]

model3 = GridSearchCV(DecisionTreeClassifier(criterion='entropy'), tuned_parameters, scoring = 'accuracy', cv=10)

model3.fit(X_train, y_train)

est=model3.best_estimator_

print("An Estimation of the Model is %s"%est)

mean_re=model3.score(X_test, y_test)*float(100)

print("Accuracy of the model is %f%%"%mean_re)

## Cross-Validation using RandomizedSearchCV

`#Using RandomizedSearchCV`

tuned_parameters = {'max_depth': np.random.uniform(1,20,10)}

model3 = RandomizedSearchCV(DecisionTreeClassifier(criterion='entropy',min_samples_split=3, min_samples_leaf=2), tuned_parameters, scoring = 'accuracy',cv=10)

model3.fit(X_train, y_train)

est=model3.best_estimator_

print("An Estimation of the Model is %s"%est)

mean_re=model3.score(X_test, y_test)*float(100)

print("Accuracy of the model is %f%%"%mean_re)

With RandomizedSearchCV we got max_depth=3 and accuracy=73% and with GridSearchCV we got max_Depth =10,min_samples_leaf=1, min_samples_split=2 and Accuracy=80%. In this way, we can tune any number of hyperparameters using GridSearchCV and RandomizedSearchCV.

**Conclusion**

The above decision tree learning explanation aimed for better understanding the whole idea behind. As we can see, the decision tree is a kind of probability tree that helps you to make a personal or business decision. In addition, they show us a balanced picture of the risks and opportunities related to each possible decision. Entropy gives randomness in an attribute and Information gain is the treasure house of the whole task. We should construct decision trees of appropriate depth in order to avoid Over-fitting and Under-fitting of data.