Decision Tree-Based Diabetes Classification in R

Decision Tree Training, Pruning and Hyperparameters Tuning.

Rahul Raoniar
The Researchers’ Guide
11 min readAug 4, 2020

--

Photo by Sara Dubler on Unsplash

Article Outline

  • What is a decision tree?
  • Why use them?
  • Data Background
  • Descriptive Statistics
  • Decision Tree Training and Evaluation
  • Decision Tree Pruning
  • Hyperparameters Tuning

What is a decision tree?

A decision tree is a representation of a flowchart. The classification and regression tree (a.k.a decision tree) algorithm was developed by Breiman et al. 1984 (usually reported) but that certainly was not the earliest. Wei-Yin Loh of the University of Wisconsin has written about the history of decision trees. You can read it here “Fifty Years of Classification and Regression Trees”.

In a decision tree, the top node is called the “root node” and the bottom node “terminal node”. The other nodes are called “internal nodes” which includes a binary split condition, while each leaf node contains associated class labels.

Photo by Saed Sayad on saedsayad.com

A classification tree uses a split condition to predict a class label based on the provided input variables. The splitting process starts from the top node (root node), and at each node, it checks whether supplied input values recursively continue to the left or right according to a supplied splitting condition (Gini or Information gain). This process terminates when a leaf or terminal node is reached.

Why use them?

A single decision tree-based model is easy to build, plot and interpret which makes this algorithm so popular. You can use this algorithm for performing classification as well as a regression task.

Data Background

In this example, we are going to use the Pima Indian Diabetes 2 data set obtained from the UCI Repository of machine learning databases (Newman et al. 1998).

This data set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the data set is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the data set. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The Pima Indian Diabetes 2 data set is the refined version (all missing values were assigned as NA) of the Pima Indian diabetes data. The data set contains the following independent and dependent variables.

Independent variables (symbol: I)

  • I1: pregnant: Number of times pregnant
  • I2: glucose: Plasma glucose concentration (glucose tolerance test)
  • I3: pressure: Diastolic blood pressure (mm Hg)
  • I4: triceps: Triceps skinfold thickness (mm)
  • I5: insulin: 2-Hour serum insulin (mu U/ml)
  • I6: mass: Body mass index (weight in kg/(height in m)\²)
  • I7: pedigree: Diabetes pedigree function
  • I8: age: Age (years)

Dependent Variable (symbol: D)

  • D1: diabetes: diabetes case (pos/neg)

Aim of the Modelling

  • fitting a decision tree classification machine learning model that accurately predicts whether or not the patients in the data set have diabetes
  • Decision tree pruning for reducing overfitting
  • Decision tree hyperparameters tuning

Loading relevant libraries

The first step of data analysis starts with loading relevant libraries.

Loading dataset

The very next step is to load the data into the R environment. As this comes with mlbench package one can load the data calling data( ).

Data Preprocessing

The next step would be to perform exploratory analysis. First, we need to remove the missing values using the na.omit( ) function. Print the data types using glimpse( ) method from dplyr library. You can see that all the variables except the dependent variable (diabetes: categorical/factor) are double type.

Data Types

Train and Test Split

The next step is to split the dataset into 80% train and 20% test. Here, we are using the sample( ) method to randomly pick the observation index for train and test split with replacement. Next, based on indexing we split out the train and test data.

The train data includes 318 observations and test data included 74 observations. Both contain 9 variables.

Train and Test Dimension

Model Training

The next step is the model training and evaluation of model performance

Training a Decision Tree

For decision tree training, we will use the rpart( ) function from the rpart library. The arguments include; formula for the model, data and method.

formula = diabetes ~. i.e., diabetes is predicted by all independent variables (excluding diabetes)

Here, the method should be specified as the class for the classification task.

Model Plotting

The main advantage of the tree-based model is that you can plot the tree structure and able to figure out the decision mechanism.

Diabetes_model Tree Structure

Model Performance Evaluation

Next, step is to see how our trained model performs on the test/unseen dataset. For predicting the test data class we need to supply the model object, test dataset and the type = “class” inside the predict( ) function.

(a) Confusion matrix

To evaluate the test performance we are going to use the confusionMatrix( ) from caret library. We can observe that out of 74 observations it wrongly predicts 17 observations. The model has achieved about 77.03% accuracy using a single decision tree.

Diabetes_model Test Evaluation Statistics

(b) Test accuracy

We can also supply the predicted class labels and original test dataset labels to the accuracy( ) function for estimating the model accuracy.

Diabetes_model Test Accuracy

Splitting Criteria Based Model Comparision

While building the model the decision tree algorithm uses splitting criteria. There are two popular splitting criteria used in decision trees; one is called “gini” and others called “information gain”. Here, we try to compare the model performance on the test set after training with different split criteria. The splitting criteria are supplied using parms argument as a list.

Model Evaluation on Test Data

After model training, the next step is to predict the class labels of the test dataset.

Prediction Accuracy Comparision

Next, we compare the accuracy of the models. Here, we can observe that “gini” based splitting criteria is providing a more accurate model than “information” based splitting.

Diabetes_model1 Test Accuracy
Diabetes_model2 Test Accuracy

The initial model (Diabetes_model) and the “gini” based model (Diabetes_model1) providing the same accuracy, as rpart model uses “gini” as its default splitting criteria.

Decision Tree Pruning

The initial model (Diabetes_model) plot shows that the tree structure is deep and fragile which might reduce the easy interpretation in the decision-making process. Thus here we would try to explore other ways to make the tree more interpretable without losing performance. One way of doing this is by pruning the fragile part of the tree (part contributes to model overfitting).

(a) Plotting the error vs complexity parameter

The decision tree has one parameter called complexity parameter (cp) which controls the size of the decision tree. If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue. We can generate the cp vs error plot using the plotcp( ) library.

Error vs CP Plot

(b) Generating complexity parameter table

We can also generate the cp table by calling model$cptable. Here, you can observe that xerror is minimum with CP value of 0.025.

(c) Obtaining an optimal pruned model

We can filter out the optimal CP value by identifying the index of minimum xerror and by supplying it to the CP table.

The next step is to prune the tree using prune( ) function by supplying optimal CP value. If we plot the optimal pruned tree we can now observe that the tree is very simple and easy to interpret.

If a person has a glucose level above 128 and age greater than 25 will be designated as diabetes positive else negative.

(d) Pruned tree performance

The next step is to check whether the prune tree has similar performance or the performance has been compromised. After the performance check, we can see that the pruned tree is as capable as the earlier fragile tree but now it is simple and easy to interpret.

Decision Tree Hyperparameter Tuning

Next, we would try to increase the performance of the decision tree model by tuning its hyperparameters. The rpart( ) offers different hyperparameters but here we will try to tune two important parameters which are minsplit, and maxdepth.

  • minsplit: the minimum number of observations that must exist in the node in order for a split to be attempted.
  • maxdepth: The maximum depth of any node of the final tree.

(a) Generating hyperparameter grid

First, we generate a sequence 1 to 20 for both minsplit and maxdepth. Then we build a parameter combination grid using expand.grid( ) function.

(b) Training grid-based models

The next step is to train different models based on each grid hyperparameter combination. This could be done through the following steps:

  • using a for loop to loop through each hyperparameter in the grid and then supplying it to rpart( ) function for model training
  • storing each model into an empty list (diabetes_models)

(c) Computing test accuracy

The next step is to check the model performance of each model on test data and retrieving the best model. This could be done through the following steps:

  • using a for loop to loop through each model in the list, and then predicting the test data and computing accuracy
  • storing each model accuracy into an empty vector (accuracy_values)

(d) Identifying the best model

The next step is to retrieve the best performing model (maximum accuracy) and printing its hyperparameters using model$control. We can observe that with a minimum split of 17 and a maximum depth of 6 the model provides most accurate results when evaluated on unseen/test dataset.

(e) Best model evaluation on test data

After identifying the best performing model, the next step is to see how accurate the model is. Now, with the best hyperparameters, the model achieved an accuracy of 81.08% which is really great.

(f) Best model plot

Now it is time to plot the best model.

Best Model’s Layout

Even the above plot is for best performing model, still, it looks a little bit fragile. So your next task would be to prune it and see if you get a better interpretable decision tree or not.

I hope you learned something new. See you next time!

References

[1] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.

[2] Loh, W. (2014). Fifty Years of Classification and Regression Trees 1.

[3] Newman, C. B. D. & Merz, C. (1998). UCI Repository of machine learning databases, Technical report, University of California, Irvine, Dept. of Information and Computer Sciences.

Rahul Raoniar

--

--

Rahul Raoniar
The Researchers’ Guide

👨‍🔬Researcher | 🐍Python | 🧮Rstats | 👨‍🔬Stata | 📊Data Science & ML | 👩‍💻Blogger | ▶YouTuber | 🌐Website: https://www.rahulraoniar.com/