Data visualization and diagnosis of diabetes using logistic regression

Jacky Lim
Analytics Vidhya
Published in
6 min readMay 28, 2021

Let’s talk about the data first. The data can be downloaded from Kaggle database, provided that you have an Kaggle account. The goal of the data is to predict potential diabetes cases based on certain diagnostic measures. All the subjects surveyed are females at least 21 years old of Pima Indian heritage.

Logistic regression (LR) is one of the most important predictive models in classification. To put it simple, logistic regression can be used to model the probability of diabetes. The key concept of logistic regression is the logit, the natural logarithm of odds ratio.

In the above equation, Pᵢ is the probability of diabetes for patient i. β is the coefficients of LR model, while x is the features of data sample i.

For this dichotomous classification task, I will be using R programming to load the data, split it into training and test dataset, perform data visualization and model training using training dataset, and eventually evaluate the model using the hold-out dataset.

First, load important library packages.

library(cdata)    # data wrangling
library(ggplot2) # elegant plots
library(knitr) # display table
library(reshape2) # data wrangling
library(WVPlots) # ROC plot, double density plot and etc

Next, load the data into the working environment (remember to set the working directory to where the file is located by either using syntax setwd() or pressing ‘Ctrl + Shift + H’). View the statistical summaries of each features and their variable types (e.g. continuous, Boolean, categorical). There are a total of 768 samples (subjects) and 9 variables (8 attributes and a target).

data=read.csv(‘diabetes.csv’)
summary(data)
str(data)
Output of syntax summary() and str().

From the results, we can get a rough idea on the distribution of each variable. For instance, the age of subjects ranges from 21 to 81 years old and half of them are 29 or below; Insulin is positively skewed as shown by large difference between its mean and median.

The code snippets below shows the number of positive cases (diabetes) vs number of negative cases of the data. The data consists of 268 positive cases, while the rest 500 are negative cases.

data$Outcome=as.factor(data$Outcome)
summary(data$Outcome)

Next, partition the data into training set and hold-out set. Training set is utilized in data visualization and training of LR model, whereas hold-out set (test data) is used only for assessing model performance.

# for reproducibility
set.seed(123)
randn=runif(nrow(data))
train_idx=randn<=0.8 # Roughly 80% training data, 20% test data
train=data[train_idx,]
test=data[!train_idx,]

Data Visualization

Data visualization is presentation of data in graphical format. It is not only intuitive, but could be helpful in exploring data structure and detecting outliers. It is worth noting that we should only use the test data during model evaluation phase.

target=’Outcome’
vars=setdiff(colnames(train),target)
# moving data from wide to tall form (be careful with the #indentation
data_long=unpivot_to_blocks(data,nameForNewKeyColumn = “variables”,
nameForNewValueColumn = “values”,columnsToTakeFrom = vars)
# plot the histogram
ggplot(data_long,aes(x=values)) +
geom_histogram(bins=10,fill=”gray”) +
facet_wrap(~variables,ncol = 3,scales=”free”)

Grouped boxplot is another excellent visualization tool to display the distribution of variables for each class.

# boxplot for each variables with regards to group
ggplot(data_long,aes(x=Outcome,y=values)) +
geom_boxplot(color=”blue”,fill=”blue”,alpha=0.2,notch=TRUE,
outlier.color=”red”,outlier.fill = “red”,outlier.size = 2) +
facet_wrap(~variables,ncol = 3,scales = “free”)

From the boxplots above, it is obvious that variable ‘Glucose’ is generally higher among those diagnosed with diabetes as the notches of 2 classes do not overlap. However, it is important to take note of the presence of outliers. This same circumstance is also evident in age, pregnancies and BMI predictors.

Correlation matrix is also helpful in revealing (linear) relationship between 2 variables as shown in the heatmap below. Age is positively correlated to number of pregnancy with Pearson correlation coefficient of 0.57.

cormat=cor(train[,vars]) # correlation matrixcormat[upper.tri(cormat)]=NA 
melted_cormat=melt(cormat,na.rm = TRUE)
# heat-map
ggplot(data=melted_cormat,aes(Var1,Var2,fill=value)) +
geom_tile(color=’white’) +
scale_fill_gradient2(low = “blue”, high=”red”,mid=”white”,midpoint = 0,
limit=c(-1,1),space = “Lab”,name=”Correlation”) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45,vjust = 1,size = 12,hjust = 1)) +
coord_fixed() +
geom_text(aes(Var1,Var2,label=round(value,2)),color=”black”,size=3)
# plot grouped scatter plot
# Create the plots
pairs(data[,vars], pch=20,cex=0.3,col=data$Outcome,
lower.panel = NULL)

The above group scatter plot matrix shows that there is no discernible decision boundary in each scatter plot that can separate the two classes.

Logistic regression

The training of LR model can be achieved by syntax glm() by setting argument: family=binomial(“logit”). Logistic regression assumes linearity between independent variables and log odds. Despite having this theoretical restriction, coefficients of parametric model can provide clues about the relationship between individual attributes and odds/probability of positive outcomes (diabetes).

model=glm(Outcome~.,data=train,family = binomial(“logit”))
summary(model)

Most of the coefficients are statistically significant except ‘SkinThickness’ and ‘Insulin’. How can we interpret the model’s coefficients? Lets take ‘BMI’ as example. The coefficient estimate of 0.095 indicates that there will be 9.5% increase in log odds should BMI increases by 1 unit. Then, how about if we want to know the probability instead? Assume that a person has a probability of 0.25 of getting diagnosed as diabetic, what will happen to probability of getting diagnosed as diabetic if BMI increases by 1 while other variables remain unchanged? Let’s work out the math.

So, an increase of BMI by one would raise the probability of diabetes onset from 0.25 to 0.2682. This same calculation applies for other variables as well. Even though the interpretation is not as straightforward as linear regression, it enable us to gain insight regarding contribution of each predictor towards response.

Model evaluation

Model generalization performance on previously unseen data (hold-out dataset) is assessed by accuracy, precision and recall (sensitivity).

# prediction
train$pred=predict(model,newdata = train,type = “response”)
test$pred=predict(model,newdata=test,type=”response”)
# confusion matrix
(confmat_test=table(truth=test$Outcome,predict=test$pred>0.5))
(acc_test=sum(diag(confmat_test))/sum(confmat_test))
(precision=confmat_test[2,2]/sum(confmat_test[,2]))
(recall=confmat_test[2,2]/sum(confmat_test[2,]))

The performance of the model evaluated using test data: accuracy: 0.7355, precision: 0.7083 and recall: 0.5574. Furthermore, the model performance can be visualized by receiver operating characteristic (ROC) curve and double density plot as shown in code snippet and graphical outputs below.

plt=DoubleDensityPlot(test,xvar = “pred”,truthVar = “Outcome”,
title = “Distribution of scores of LR model in test data”)
plt+geom_vline(xintercept = 0.5, color=”red”,linetype=2)
test$Outcome_numeric=as.numeric(test$Outcome)-1
ROCPlot(test,xvar=’pred’,truthVar = “Outcome_numeric”,
truthTarget = TRUE,
title = “Logistic regression test performance”)

Fine-tuning

A specific threshold for the scores (model output) has to be determined to binarize the outcomes (either diabetes or non-diabetes). Although the default is normally set as 0.5, but one could change the threshold to optimize the tradeoff between precision and recall. Graph that shows both enrichment and recall as functions of the threshold is helpful in selecting appropriate threshold. More details on how to select threshold, please refer to this e-book. Empirical observation shows that 0.375 could be a good threshold. Note that we should always use training data instead of test data for this purpose to prevent data leakage.

train$Outcome_numeric=as.numeric(train$Outcome)-1
plt=PRTPlot(test,’pred’,’Outcome_numeric’,TRUE,
plotvars = c(‘enrichment’,’recall’),
thresholdrange = c(0,1),
title = ‘Enrichment/recall vs threshold for LR model’)
plt+geom_vline(xintercept = 0.375,color=”red”,linetype=2)
# Confusion matrix of test data
(confmat_test=table(truth=test$Outcome,predict=test$pred>0.375))
(acc_test=sum(diag(confmat_test))/sum(confmat_test))
(precision=confmat_test[2,2]/sum(confmat_test[,2]))
(recall=confmat_test[2,2]/sum(confmat_test[2,]))

This new threshold gives us a better generalization performance in terms of accuracy and recall but slight decline in precision: accuracy: 0.7548, precision: 0.6825, and recall: 0.7049. However, it should be mindful that if the model is applied to future data with very different distribution as its training data, the model would underperforms.

LR model is an interpretable and powerful predictive models when it comes to binary classification problem. Machine learning practitioners should always go for simple models like logistic regression before applying more advanced predictive models. The above code can be found in Github and RPubs.

--

--

Jacky Lim
Analytics Vidhya

Currently a lecturer in a private university in Malaysia