A Gentle Introduction to Decision Tree Learning

Trees have a great role in our life. Life will be very difficult without trees or we can say that life would be finished without trees because trees are the most important aspect of a healthy and wealthy life. Likewise, some parts of Classification and Regression problems won’t be this simple without Decision Tree Learning.

Introduction

Decision tree learning is a method for approximating discrete-valued target functions, which is a part of supervised learning, where we learn a mapping from input to output by analyzing examples for which true values are given by a supervisor or a teacher or a human being which is generally treated as Train-data set and the model trained on train data set is used to classify unseen data, the learned function is represented by a decision tree.

Decision Trees

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a data set into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a Decision Tree which has a flowchart-like tree structure, with decision nodes and leaf nodes, where each decision node or internal node (non-leaf node) denotes a test on an attribute, it has two or more branches and each branch represents an outcome of the test and each leaf node (or terminal node) holds a class label, represents a classification or decision. The topmost decision node in a tree which corresponds to the best predictor is called the root node. Decision trees can handle both categorical and numerical data.

Types of Decision Trees

Types of Decision tree depends on the kind of target variable we have. It can be of two types:

1. Categorical Variable Decision Tree: Decision Tree with categorical target variable is called as a categorical variable Decision tree. The objective variable will be like YES or NO.

2. Continuous Variable Decision Tree: If the Decision Tree has a consistent target variable then it is called as Continuous Variable Decision Tree.

Decision trees are the representation of nested if-else conditions based upon which traversal along the tree will be done. A typical decision tree is shown below.

Geometrically, a decision tree is a set of Axis-parallel Hyperplanes which divides our region into cubes, cuboids or hypercubes and hypercuboids. Here we are dividing our dataset into 3 classes i.e, class 1,2 and 3. We are dividing the region into 3 parts using 2 threshold values a,b. These threshold values can be any value, it’s purely problem-specific.

If height<a THEN y_i=1

If height≥a AND weight<b THEN y_i=2

If height≥a AND weight≥b THEN y_i=3

Building a Decision Tree

The central choice in the decision tree is selecting which attribute to test at each node in the tree. Constructing a decision tree is all about finding an attribute that returns the highest information gain, in order to define information gain precisely, a measure called entropy is used.

Entropy

Entropy is the measure of disorder. Also defined as Unpredictability or Randomness of an attribute/feature. Entropy tells us how messy our data is. All the classes will be together in our initial data set like a messy kids room and using Entropy and Information gain, attributes will be placed in different levels in a tree.

Decision Tree’s goal is to arrange all the data points in an order, according to their class labels. We try to separate our data and group the samples together in the classes they belong to. We know their label since we construct the trees from the training set. We try to maximize the purity of the groups as much as possible whenever we create a new node of the tree (meaning we cut our set in two).

p is the proportion of examples in Sample S. c is the classes available. Entropy lies between 0 and 1.

• If all the data points belong to same class then Entropy=0
• If 50% of the points belong to one class and another 50% of points belongs to another class then Entropy=1
• If 99% of data points belong to one class and only 1% of points belongs to another class then Entropy=0.0801

Information Gain

Information gain measures how well a given attribute separates the training examples according to the target classification function.

where Values(A) is the set of all possible values for attribute A and S is the subset of S for which attribute A has value v. The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down approach, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree. In a greedy search, we always make the choice that seems to be the best at that moment, which optimizes our Objective Function(either maximize or minimize). In decision trees objective function is Information Gain. Information gain should be as high as possible, the Decision Trees algorithm will always try to maximize Information gain. An attribute with the highest Information gain will be tested/split first.

Gini Impurity

We can also use Gini Impurity rather than Entropy for calculating Information gain. This is done in a similar way to how information gain was calculated for entropy, except instead of taking a weighted sum of the entropies of each branch of a decision, we take a weighted sum of the Gini impurity.

Computing Gini Impurity is easy as compared to Entropy since squares are easily computable than logarithmic values, moreover, the maximum possible of Gini Impurity is 0.5 whereas maximum possible entropy value is 1.

Constructing Decision Tree

Let’s start constructing a decision tree for “play Tennis” with a simple data set. Target class label is playing tennis and attributes are Outlook, Temp, Humidity, and Windy with categorical values.

Step 1: Calculate Entropy on Target class

Step 2: Dataset will split based upon the Information Gain of attributes.

Attribute “outlook ” is having the highest information gain, so the attribute outlook is the Root node. In the next, we have to compute Entropy(sunny). Now, we need to apply the same procedure to decide the root node for the leftmost subtree and for the rightmost subtree. However, observe that all the samples of the middle tree are related to the same class (Yes class). So, we can create a leaf node here and label it with Yes. The final decision tree for the given data set, S, is given below.

Since we had already made outlook as the Root node, Outlook attribute will be deleted and hence dataset size will be reduced after considering outlook as our Root node. Now we have to find the gain of sunny with other attributes like Temp., Humidity and Windy.

The second node is Humidity will highest Information Gain. When Outlook=Sunny and Humidity=High, the only possible value is play_tennis=No and when outlook=Sunny and Humidity=Normal, the only possible value is play_tennis=Yes

Likewise, all the attributes will be placed in the decision tree based upon their entropy and Information gain.

IF outlook = sunny AND humidity = high THEN play_tennis = no

IF outlook= sunny AND humidity=normal THEN play_tennis=yes

If outlook=overcast THEN play_tennis=yes

If outlook=rainy AND windy=strong THEN play_tennis=no

If outlook=rainy AND windy=weak THEN play_tennis=yes

Decision Tree Implementation

For implementing Decision Trees in python we have DecisionTreeClassifier in sklearn. Here we are loading iris dataset and 80% of data is considered as train data and the remaining 20% of data as test data. In the following code, we are ‘entropy’ as the criterion value, we can also use ‘gini impurity’.

`#Importing required librariesimport pandas as pd;import numpy as npfrom sklearn.datasets import load_irisfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.model_selection import train_test_split#Loading the iris datadata = load_iris()print('Classes in the dataset: ', data.target_names)#Extracting data attributesX = data.data### Extracting target/ class labelsy = data.targetprint('Number of datapoints in the dataset:', X.shape[0])#Using the train_test_split to create train and test sets.X_train = X[:int(len(X)*.80)] ; y_train = y[:int(len(y)*.80)] X_test= X[int(len(X)*.80):] ; y_test= y[int(len(y)*.80):] #Importing the Decision tree classifier from the sklearn library.from sklearn.tree import DecisionTreeClassifierdt = DecisionTreeClassifier(criterion = 'entropy',max_depth=5,min_samples_split=3, min_samples_leaf=2)#Training the decision tree classifier. print(dt.fit(X_train, y_train))`
`# Calculate feature importancesimportances = dt.feature_importances_# Sort feature importances in descending orderindices = np.argsort(importances)[::-1]# Rearrange feature names so they match the sorted feature importancesnames = [data.feature_names[i] for i in indices]plt.figure()plt.title("Feature Importance")plt.bar(range(X.shape[1]), importances[indices])plt.xticks(range(X.shape[1]), names, rotation=90)plt.show()`

From the above figure, we can conclude that attributes like petal width and petal length are contributing much when compared to sepal width and sepal length and hence petal width and petal length are more important than sepal width and sepal length.

`from PIL import Imagefrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratornames = [data.feature_names[i] for i in indices]pos_probs_and_words = sorted(zip(dt.feature_importances_, names), reverse=True)[:]print("------------------Feature Importance:---------------------")for i in range(0,4,1):    print("Feature importance of the word='%s' is %f%%"%(pos_probs_and_words[i][1],pos_probs_and_words[i][0]*float(100)))`
`st=''for i in range(0,4,1):    st+=(pos_probs_and_words[i][1])    st+=' 'st1=st.replace('cm', '')#print(st1)# Start with one review:text = st1# Create and generate a word cloud image:wordcloud = WordCloud(background_color='white',width=1200, height=1000).generate(text)# Display the generated image:plt.imshow(wordcloud, interpolation='bilinear')plt.axis("off")plt.show()`

Sepal length and sepal widths are the most used features and the cloud word represents all the features available in the dataset.

`import graphviz from sklearn import tree #feature_names=model_bow.get_feature_names() ,dot_data = tree.export_graphviz(dt, out_file=None,max_depth=4,label='all',class_names=data.target_names,feature_names=data.feature_names,node_ids=True,impurity=True,filled=True,rounded=True) graph = graphviz.Source(dot_data) #graph.render("3",view='True')  graph`

In DecisionTree, depth is the Major hyperparameter. Deeply grown Decision trees result in overfitting of data where the depth will be too high, then there is a possibility that the number of points at leaf nodes decreases and when we have fewer points at leaf node chances of having noisy data increases, which decreases the interpretability of our model. If the tree is too shallow it leads to Underfitting. We can find the optimal Depth value which gives us minimum misclassification error cross-validation with cv=10. Other parameters like min_samples_split, min_samples_leaf can also be considered as hyperparameters. We can also find the best depth value using GridSeachCV and RandomSearchCV

`from sklearn.metrics import accuracy_score,recall_scorefrom sklearn.model_selection import cross_val_scoreimport matplotlib.pyplot as pltdepth_val=[1,2,3,4,5,6,9,12]ac_scores=[]for d in depth_val:    dt= DecisionTreeClassifier(max_depth=d,min_samples_split=3, min_samples_leaf=2)    scores4= cross_val_score(dt, X_train, y_train, cv=10, scoring='accuracy')    ac_scores.append(scores4.mean())#RECALLMSE3= [1 - x for x in ac_scores]optimal_depth_re = depth_val[MSE3.index(min(MSE3))]print('\nThe optimal depth value is %d.' % optimal_depth_re)# plotting misclassification error when accuracy-score is consideredplt.plot(depth_val, MSE3)for xy in zip(depth_val, np.round(MSE3,3)):    plt.annotate('(%s, %s)' % xy, xy=xy, textcoords='data')plt.xlabel('depth')plt.ylabel('Misclassification Error')plt.show()print("the misclassification error for each depth=%d is : "%optimal_depth_re,np.round(MSE3,3))optimal_dt= DecisionTreeClassifier(max_depth=optimal_depth_re,min_samples_split=3, min_samples_leaf=2)optimal_dt.fit(X_train, y_train)pred = optimal_dt.predict(X_train)re=accuracy_score(y_train,pred)*float(100)print('\n Accuracy of the DecisionTree Classifier with depth= %d is %f%% on train data' % (optimal_depth_re,re))# predict the responsepred = optimal_dt.predict(X_test)re=accuracy_score(y_test,pred)*float(100)print('\n Accuracy of the DecisionTree Classifier with depth= %d is %f%% on test data' % (optimal_depth_re,re))`

Optimal depth=3 is selected by cross-validation and Accuracy on train data is 97.5% and accuracy on test data is 70%. Extension for Decision Trees are Random Forest and Gradient Boosting Decision Trees(GBDT)

Cross-Validation using GridSearchCV

`#Using GridSearchCVtuned_parameters = [{'max_depth': [1,2,3,4,5,10,13,15]},{'min_samples_split':[2,3,4,5]},{'min_samples_leaf':[1,2,3,4,5]}]model3 = GridSearchCV(DecisionTreeClassifier(criterion='entropy'), tuned_parameters, scoring = 'accuracy', cv=10)model3.fit(X_train, y_train)est=model3.best_estimator_print("An Estimation of the Model is %s"%est)mean_re=model3.score(X_test, y_test)*float(100)print("Accuracy of the model is %f%%"%mean_re)`

Cross-Validation using RandomizedSearchCV

`#Using RandomizedSearchCVtuned_parameters = {'max_depth': np.random.uniform(1,20,10)}model3 = RandomizedSearchCV(DecisionTreeClassifier(criterion='entropy',min_samples_split=3, min_samples_leaf=2), tuned_parameters, scoring = 'accuracy',cv=10)model3.fit(X_train, y_train)est=model3.best_estimator_print("An Estimation of the Model is %s"%est)mean_re=model3.score(X_test, y_test)*float(100)print("Accuracy of the model is %f%%"%mean_re)`

With RandomizedSearchCV we got max_depth=3 and accuracy=73% and with GridSearchCV we got max_Depth =10,min_samples_leaf=1, min_samples_split=2 and Accuracy=80%. In this way, we can tune any number of hyperparameters using GridSearchCV and RandomizedSearchCV.

Conclusion

The above decision tree learning explanation aimed for better understanding the whole idea behind. As we can see, the decision tree is a kind of probability tree that helps you to make a personal or business decision. In addition, they show us a balanced picture of the risks and opportunities related to each possible decision. Entropy gives randomness in an attribute and Information gain is the treasure house of the whole task. We should construct decision trees of appropriate depth in order to avoid Over-fitting and Under-fitting of data.