Supervised Machine Learning Algorithm Demonstration: Decision Tree

Sasani Perera
5 min readJul 8, 2023

--

Decision trees are powerful and widely used algorithms for solving classification problems in Supervised Machine Learning as they possess the ability to precisely organize and order different classes. Think of a decision tree as a flow chart that guides us through a series of decisions to classify data points accurately. The tree structure starts with a “trunk” and branches out into “branches” and further extends into “leaves.”

As we traverse this tree, the data points are systematically divided into increasingly similar categories. The “trunk” represents the initial set of all data points, and as we move towards the “branches,” the data points become more refined and separated based on their features. Finally, at the “leaves,” we arrive at finely defined categories where the data points share significant similarities.

This hierarchical structure of decision trees allows for the creation of categories within categories, providing an organic and granular approach to classification. The beauty of decision trees lies in their ability to achieve this level of classification with limited human intervention. The tree learns and discerns patterns in the data on its own, guided by the given features and their relationships.

By leveraging decision trees, we can classify and categorize data points effectively, making informed decisions based on their unique characteristics. This algorithm empowers us to extract valuable insights and predictions from complex datasets, promoting better understanding and decision-making.

Let us start training a model with an example data set, iris.csv.

In this demonstration, we will train a model to detect the species of the Iris flower in Jupyter Notebook.

  1. Reading and understanding the data

First, we import pandas and read the .csv file using pd.read_csv().

And see the DataFrame with (pandas.DataFrame.head) and visualize the data using matplotlib pyplot.

from matplotlib import pyplot as plt
#plotting a set of graphs

plt.figure(figsize=(15,10))

plt.subplot(4,4,1) #subplot 1
plt.hist(iris['SepalLengthCm'],color='g') #plotting a histogram
plt.title('Distribution of sepal length') #setting title
plt.xlabel('Sepal length') #setting xlabel

plt.subplot(4,4,2)
plt.hist(iris['PetalLengthCm'],color='g')
plt.title('Distribution of petal length')
plt.xlabel('Petal length')

plt.subplot(4,4,3)
plt.hist(iris['SepalWidthCm'],color='g')
plt.title('Distribution of sepal width')
plt.xlabel('Sepal width')

plt.subplot(4,4,4)
plt.hist(iris['PetalWidthCm'],color='g')
plt.title('Distribution of petal width')
plt.xlabel('Petal width')

plt.show()
Histograms representing each independent variable in the data set

To get a better understanding of the data, we can use the following commands.
pandas.DataFrame.shape, pandas.DataFrame.columns, pandas.DataFrame.describe

iris.shape
iris.columns
iris.describe()

2. Detect and treat any possible missing values

We know that it is crucial to ensure that the dataset is complete and does not contain any NULL values. If there are missing values in the dataset, it can negatively impact the accuracy and reliability of the model’s predictions.

In this dataset, we do not have any missing values. But if we do, we need to treat them as we did previously (Logistic Regression, Naive Bayes).

from numpy import nan
iris.replace('', np.nan) #if you have empty data

#replacing with median
iris.fillna(iris.median(),inplace=True)

#or delete the rows and columns with NULL values
iris = iris.dropna(axis=0) #drop row
iris = iris.dropna(axis=1) #drop column

3. Model training data preparation

Using train_test_split ,

70% of the data set -> training data
30% of the data set -> test data

##setting our dependant and independant variables
y=iris[['Species']]
x=iris.drop(['Species'],axis=1)
##setting our test and training data
from sklearn.model_selection import train_test_split
xTrain,xTest,yTrain,yTest = train_test_split(x,y,test_size=0.3)

4. Model training

Here we use DecisionTreeClassifier() from sklearn.tree.

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
#training the model with our data
dtc.fit(xTrain,yTrain)

5. Predicting

Let us predict data with our x-Test data using predict(x). We get an array for ‘predicted’.

predicted = dtc.predict(xTest)
predicted

6. Accuracy

To calculate the accuracy of the model we take the confusion matrix of the predicted data and our y-Test data set.

from sklearn.metrics import confusion_matrix

confusion_matrix(yTest,predicted)

Here we get a 3-D confusion matrix. Because there are 3 classes in our output, which are, ‘Iris-setosa’, ‘Iris-versicolor’, and ‘Iris-virginica’.

How to know the TP, TN, FP, and FN values as in the above case?

In the multi-class classification problem, we won’t get TP, TN, FP, and FN values directly as in the binary classification problem. For validation, we need to calculate for each class.

Setosa: TP = cell 1, FN = ( cell 2 + cell3 ), FP = ( cell 4 + cell 7 ), TN = (cell 5 + cell 6 + cell 8 + cell 9)
Versicolor: TP = cell 5, FN = (cell 4 +cell 6), FP = (cell 2 + cell 8), TN = (cell 1 + cell 3 + cell 7 + cell 9)
Virginica: TP = cell 9, FN = ( cell 7 + cell 8), FP = (cell 3 + cell 6), TN = (cell 1 + cell 2 + cell 4 + cell 5)

The accuracy of the model would be the addition of TPs in all sectors.

accuracy = (S: TP + Ve: TP + Vi: TP) / (addition of all cell values )

Complete Code: Decision_Tree_Demo

In the next article, we will do demonstrations on Random Forest.

Thank you and Happy Reading!

Follow For More.

--

--