How to Code and Evaluate of Decision Trees
In my most recent blog, I discussed the two most common metrics in decision trees, the entropy/information gain and the Gini index. In this post, I will discuss how to use Python to code a decision trees and the dangers that can occur using decision trees.
Coding Decision Trees
To begin coding our trees, let’s assume that we have a Pandas data frame called df
with a categorical target variable. In addition to Pandas you should also import the following to create the decision tree.
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
Note that in this particular case I am using a DecisionTreeClassifier because we are predicting the target class. You may instead import a DecisionTreeRegressor if you have a continuous target variable.
As always, the first thing you should do with your data frame is to perform a train-test-split prior to any cleaning of the data. Remember that you don’t want to borrow information from your testing group when you are training the model.
X = df.drop(['target'], axis=1)
y = df['target']X_train, X_test, y_train, y_test = train_test_split(X,y)
Then we can fit our decision tree and collect what our y_hats are for the training and testing data.
dec_tree = DecisionTreeClassifier()
dec_tree.fit(X_train, y_train)
y_train_hat = dec_tree.predict(X_train)
y_test_hat = dec_tree.predict(X_test)
Inside the DecisionTreeClassifer
we can put several parameters to prevent overfitting of the decision tree. This is a very important set of parameters because Decision Trees are very prone to overfitting. If for example you are always choosing the split that gives the most information. In addition, if you have an unlimited number of splits, the tree WILL be perfectly fit to the training data and will not reflect the testing data.
Here are some of the parameters you can tune to prevent overfitting to the training data.
- criterion: (default = “gini”) the metric used to create splits. Use “entropy” for information gain.
- max_depth: (default = None) the maximum number of layers your tree will have. When None, the layers will continue until a pure split is achieved or another min/max parameter is achieved.
- min_samples_split: (default = 2) the minimum number of samples in an internal node that allows for a split to occur. If a node has less than this number if becomes a leaf (terminal node).
- min_samples_leaf: (default = 1) the minimum number of samples required for a leaf node. A split will only occur if the nodes that result from the split meet this minimum. This can be especially useful in regression trees.
Evaluating Decision Trees
Now that we have created our decision tree and collected our y_hat values we can evaluate our Decision Tree using the testing data. In a binary classifier, one great metric to use is the ROC-AUC curve and a confusion matrix.
These metrics will require the following imports.
from sklearn.metrics import (roc_curve, auc, roc_auc_score,
confusion_matrix)import matplotlib.pyplot as plt
import numpy as np
import itertools
We then want to define some functions to get the AUC scores and plot the ROC-AUC curve. clf
stands for classifier model.
def get_auc_scores(clf, X_train, X_test, y_train, y_test):
“””Prints the AUC scores for training and testing data
and returns testing score”””
y_train_score = clf.predict_proba(X_train)[:, 1]
y_test_score = clf.predict_proba(X_test)[:, 1] auc_train = roc_auc_score(y_train, y_train_score)
auc_test = roc_auc_score(y_test, y_test_score) print(f”””
Training AUC: {auc_train}
Testing AUC: {auc_test}”””)
return y_test_score
Once you have the y_test_score from the above function we can use it in the ROC curve.
fpr
is the false positive rate, tpr
is the true positive rate.
def plot_roc_curve(y_test, y_test_score):
“””Plot ROC curve for testing data”””
fpr, tpr, _ = roc_curve(y_test, y_test_score)
roc_auc = auc(fpr, tpr) plt.figure()
plt.plot(fpr, tpr, label=”ROC curve (area = %0.2f)” % roc_auc)
plt.plot([0, 1], [0, 1], “k — “)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(“False Positive Rate”)
plt.ylabel(“True Positive Rate”)
plt.title(“Receiver operating characteristic”)
plt.legend(loc=”lower right”)
plt.show()
Below is the output of this function. To read the ROC curve, we are looking to have the bend of the curve in the upper left corner. The closer the curve is to the 45-degree dashed line, the worse our model performs.
We can also plot a confusion matrix which will give us the numeric breakdown of all true/false positive/negatives in the testing data.
def show_cm(y_true, y_pred, class_names=None, model_name=None):
“””Show confusion matrix”””
cf = confusion_matrix(y_true, y_pred)
plt.imshow(cf, cmap=plt.cm.Blues) if model_name:
plt.title(“Confusion Matrix: {}”.format(model_name))
else:
plt.title(“Confusion Matrix”)
plt.ylabel(“True Label”)
plt.xlabel(“Predicted Label”) if class_names:
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
else:
class_names = set(y_true)
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names) thresh = cf.max() / 2.0 for i, j in itertools.product(range(cf.shape[0]),
range(cf.shape[1])):
plt.text(j, i, cf[i, j],
horizontalalignment=”center”,
color=”white” if cf[i, j] > thresh else “black”,
) plt.colorbar()
In the confusion matrix we can see how well the decision tree made decisions. Using the colormap it is easy to identify that most of the items have been correctly classified with the diagonal blue line. We can then also use the confusion matrix for additional metrics that may be better attuned for the specific business problem.
Decision Trees themselves are powerful tools, but we want to be cautious about overfitting as they are very prone to do so. We can play around with parameters and evaluate how each of them perform using a ROC curve or confusion matrix.