How to start using Decision Tree Classification in R

Published in

Data And Beyond

5 min readJul 12, 2023

Hello again, my fellow reader! We are on our way to mastering the basics of machine learning (ML), using dummy datasets. Last time we understood the principle of unsupervised k-means clustering and explored the concept of “accuracy” (very superficially for now).

But today, in this part #30 of the “R for Applied Economics” guide, where we collectively explore various depths of R, data science, and financial/economic analysis, we will learn how to apply basic machine learning algorithms to cluster and classify data.

Please, keep in mind, that it is not complicated at all! Basic implementations of the most popular algorithms, such as decision trees, are relatively easy to understand and can be immediately used in practice.

What are decision trees?

Decision trees, in their essence, are simple yet powerful. They are a type of supervised learning algorithm that is mostly used in classification problems. It is important to note, that it works for both categorical and continuous input and output variables.

In this technique, we split the population or sample into two or more homogeneous sets based on the most significant splitter/differentiator in input variables. It is a super simple thing to do, even though it may sound complicated at first. So, let’s embark on this journey by loading the Iris dataset and splitting it into a training set and a test set.

Preparation of train/test data

We already decided to use the iris dataset, but there are some unsolved issues that are needed to be tackled before we are able to grow our decision trees, namely - we need train and test datasets. Why is it so important? Imagine this: you’re a data scientist working on a ML model. You’ve spent countless hours fine-tuning your model, and it’s performing exceptionally well on your dataset. Then you deploy your model for production, only to find that it performs poorly on new, unseen data. This is a classic case of overfitting, where the model has become too complex and has essentially “memorized” the training data, failing to generalize to new data.

And in this very moment, the concept of train/test split comes into play. By splitting our dataset into a training set and a test set, we create a mechanism to assess how well our model is likely to perform on unseen data. The training set is used to train the model, while the test set is used to evaluate the model’s performance.

In order to do that in R, we will split it with the proportion of 70% train and 30% test.

# data preparation
# Load the iris dataset
data(iris)

# Split the data into training and test sets
set.seed(20)
train_index <- sample(1:nrow(iris), nrow(iris)*0.7)

# train dataset formation
train_set <- iris[train_index, ]
str(train_set)

# test dataset formation
test_set <- iris[-train_index, ]
str(test_set)

Here we have 105 objects in train data and 45 in test data. Purely random split! Now we can proceed with trees.

Time to grow your decision tree

Now, let’s construct our decision tree using the rpart function from the rpart package.

# Build the decision tree model
library(rpart)
iris_tree <- rpart(Species ~ ., data = train_set, method = "class")

Okay, we have something, you can explore the structure of iris_tree object in your object panel. But if you are a visual person like I am, then we can visualize our tree using the rpart.plot package.

# Plot the decision tree
library(rpart.plot)
rpart.plot(iris_tree, main = "Decision Tree for Iris Dataset")

This plot provides a visual representation of the decision-making process of the decision tree for classifying a given iris flower based on its features.

A decision tree plot consists of nodes (the boxes) and edges. Each internal node represents a decision based on one of the input variables, and each edge represents the outcome of that decision. The terminal nodes, also known as leaves, represent the final predictions of the model.

Here we see the criteria which we used to dissect data into subsets of different classes (Petal.Length < 2.5, Petal.Width < 1.8). With such criteria, we split our dataset into 32% of “setosa”, 44% to “versicolor” and 24% of “virginica”. And we in fact make a small error, which you may see on terminal nodes. For example, in “versicolor” terminal leave there is a small fraction of objects classified as such, but in fact, they are not (in “virginica” there is the opposite situation).

Time to make predictions

Next, let’s use our decision tree to make predictions on our test set. To do that, we take our tree and test data to make predictions based on the derived model.

# Predict on the test set
predictions <- predict(iris_tree, test_set, type = "class")
predictions

Here we have assigned the predicted class to all test objects. Finally, we can evaluate the performance of our model by creating a confusion matrix, which provides a summary of the prediction results on the test set

# Evaluate the model
library(caret)
cm <- confusionMatrix(predictions, test_set$Species)
cm

What this output gives us? We see that with an accuracy of 97.78%, we have successfully determined the class of the objects in the test data. Pretty impressive, right? Let me just remind you that we have not provided any criteria, principles and etc. to classify data. We just said what we have and what we would like to predict. Everything else had been done automatically with only one error for a dataset of 45 objects.

Wrap-up

Such simple decision trees offer a flexible, easy-to-understand approach to classification and data exploration. They are a valuable tool in any data scientist’s arsenal, providing a solid foundation for more complex algorithms and techniques. Today we got familiar with decision trees, rpart package, train/test data split concept, and again touched the topic of accuracy.

Stay tuned for more such explorations, as we continue our quest for knowledge in the fascinating world of data science 🚀

Please clap 👏 and subscribe if you want to support me. Thanks! ❤️‍🔥