Binary Logistic Regression

Customer satisfaction for a product — Satisfied vs Dissatisfied (Source: pixabay.com)

Have you ever come across a situation where you want to predict a binary outcome like:

  • Whether a person is satisfied with a product or not?
  • Whether a candidate will secure admission to a graduate school or not?
  • Of the two presidential candidates who will win the election?
  • If a plane will arrive at its destination at the scheduled time?

A very simple Machine Learning algorithm which will come to your rescue is Logistic Regression.

Logistic Regression is a classification algorithm which is used when we want to predict a categorical variable (Yes/No, Pass/Fail) based on a set of independent variable(s).

In the Logistic Regression model, the log of odds of the dependent variable is modeled as a linear combination of the independent variables.

Let’s get more clarity on Binary Logistic Regression using a practical example in R.

Consider a situation where you are interested in classifying an individual as diabetic or non-diabetic based on features like glucose concentration, blood pressure, age etc.

Description of the data

For our analysis, we’ll be using Pima Indians Diabetes database from ‘mlbench’ package in R

install.packages('mlbench')
install.packages('MASS')
install.packages('pROC')
library(mlbench)
library(MASS)
library(pROC)
data(PimaIndiansDiabetes2)
head(PimaIndiansDiabetes2)

Diabetes is the binary dependent variable in this dataset with categories — pos/neg. We have the following eight independent variables

  • Pregnant: Number of times pregnant
  • Glucose: Plasma glucose concentration (glucose tolerance test)
  • Pressure: Diastolic blood pressure (mm Hg)
  • Triceps: Skin fold thickness (mm)
  • Insulin: 2-Hr serum insulin (mu U/ml)
  • Mass: Body mass index (weight in Kg/ (height in m)² )
  • Pedigree: Diabetes pedigree function
  • Age: Age (years)

Let’s now analyze the descriptive statistics for this dataset:

summary(PimaIndiansDiabetes2)

It is evident from the summary statistic that there are certain missing values in the dataset, they are being highlighted as NA’s.

As a conservative measure, we can remove such observations.

newdata <- na.omit(PimaIndiansDiabetes2)
summary(newdata)

Let’s analyze the distribution of each independent variable:

par(mfrow = c(4,2))
for( i in 1:8){
hist(newdata[,i], main = colnames(newdata)[i],xlab = colnames(newdata)[i], col = 'yellow')
}
Histogram of independent variables

From the above histograms it is evident that the variables — Pregnant and Age are highly skewed, we can analyze them in buckets.

For Age we can create following four buckets: 20–30, 31–40, 41–50 and 50+

newdata$age_bucket <- as.factor(ifelse(newdata$age<=30,"20-30",ifelse(newdata$age<=40,"31-40",ifelse(newdata$age<=50,"41-50","50+"))))

For Pregnant we can create following three buckets : 0–5, 6–10 and 10+

newdata$preg_bucket <- as.factor(ifelse(newdata$pregnant<=5,"0–5",ifelse(newdata$pregnant<=10,"6–10","10+")))

For continuous independent variables, we can get more clarity on the distribution by analyzing it w.r.t. dependent variable.

par(mfrow = c(3,2))
boxplot(glucose~diabetes, ylab="Glucose", xlab= "Diabetes", col="light blue",data = newdata)
boxplot(pressure~diabetes, ylab="Pressure", xlab= "Diabetes", col="light blue",data = newdata)
boxplot(triceps~diabetes, ylab="triceps", xlab= "Diabetes", col="light blue",data = newdata)
boxplot(insulin~diabetes, ylab="Insulin", xlab= "Diabetes", col="light blue",data = newdata)
boxplot(mass~diabetes, ylab="Mass", xlab= "Diabetes", col="light blue",data = newdata)
boxplot(pedigree~diabetes, ylab="Pedigree", xlab= "Diabetes", col="light blue",data = newdata)
Box Plot for Continuous Independent Variables

From the above plots, we can infer that median glucose content is higher for patients who have diabetes. Similar inferences can be drawn for rest of the variables.

For categorical independent variables, we can analyze the frequency of each category w.r.t. the dependent variable

xtabs(~diabetes + age_bucket, data = newdata)
xtabs(~diabetes + preg_bucket, data = newdata)

We’ll now create a new data frame of relevant modelling variables.

newdata2 <- newdata[,c("diabetes","glucose","pressure","triceps","insulin","mass","pedigree","age_bucket","preg_bucket")]

Implementation of Logistic Regression to predict the binary outcome — diabetes in the dataset “newdata2”.

logit_1 <- glm(diabetes~., family = binomial,data = newdata2)

Analysis of Model Summary

summary(logit_1)

The summary statistics helps us in understanding the model better by providing us with the following information:

  1. Distribution of the deviance residuals
  2. Intercept and slope estimates along with standard error, z-value and p-value
  3. AIC value
  4. Residual deviance and Null deviance

Interpretation of Results

For continuous variables, the interpretation is as follows:

For every one unit increase in glucose, the log odds of being diabetic ‘pos’(versus being diabetic ‘neg’) increases by 0.039.
Similarly, for one unit increase in pressure, the log odds of being diabetic ‘pos’(versus being diabetic ‘neg’) decreases by 0.0045.

For categorical variables, the performance of each category is evaluated w.r.t. a base category. The base category for the variable ‘age_bucket’ is 20–30 and for ‘preg_bucket’ is 0–5. The interpretation of such variables is as follows:

Being in the age bucket of 31–40, versus age bucket of 20–30, changes the log odds of being diabetic ‘pos’(versus being diabetic ‘neg’) by 0.854.

Being in the pregnancy bucket of 6–10, versus pregnancy bucket of 0–5, changes the log odds of being diabetic ‘pos’(versus being diabetic ‘neg’) by -0.24.

Variable Selection

The model ‘logit_1', might not be the best model with the given set of independent variables.

There are multiple methodologies for variable selection. In this article, we’ll explore only the ‘stepAIC’ function.

The ‘stepAIC’ function in R performs stepwise model selection with an objective to minimize the AIC value.

logit_2 <- stepAIC(logit_1)

Analyzing Model Summary for the newly created model with minimum AIC

summary(logit_2)

After implementing ‘stepAIC’ function, we are now left with four independent variables — glucose, mass, pedigree, and age_bucket. Of all the possible models, this model (logit_2) has the minimum AIC value.

Moreover, the shortlisted variables are highly significant.

Analysis of the outcome

To analyze the predicted probability of having the value of “diabetes” as “pos” we can use the summary function as below

summary(logit_2$fitted.values)

We can also analyze the distribution of predicted probability of ‘pos’ diabetes.

hist(logit_2$fitted.values,main = " Histogram ",xlab = "Probability of 'pos' diabetes", col = 'light green')

Let’s now classify the prediction as “pos” if the fitted value exceeds 0.5 otherwise “neg”.

newdata2$Predict <- ifelse(logit_2$fitted.values >0.5,"pos","neg")

Model Performance Evaluation

We can now evaluate the performance of the model using the following parameters:

  1. AIC

AIC stands for Akaike Information Criteria. It is analogous to adjusted R² and is the measure of fit which penalizes model for the number of independent variables. We always prefer a model with minimum AIC value.

We can compare the AIC of the original model — logit_1 and the model derived by stepAIC function — logit_2.

logit_1$aic
logit_2$aic

As expected, the model derived by stepAIC function corresponds to lower AIC value.

2. Confusion Matrix

It is a tabular representation of Observed vs Predicted values. It helps to quantify the efficiency (or accuracy) of the model.

Let’s now compare the observed values of “diabetes” with the predicted values:

mytable <- table(newdata2$diabetes,newdata2$Predict)
rownames(mytable) <- c("Obs. neg","Obs. pos")
colnames(mytable) <- c("Pred. neg","Pred. pos")
mytable
efficiency <- sum(diag(mytable))/sum(mytable)
efficiency

From Confusion Matrix, the accuracy of our model is 81.4%.

3. ROC Curve

ROC stands for Receiver Operating Characteristic. It explains model’s performance by evaluating Sensitivity vs Specificity.

roc(diabetes~logit_2$fitted.values, data = newdata2, plot = TRUE, main = "ROC CURVE", col= "blue")

Area under the ROC Curve is an index of accuracy. Higher the area under the curve, better the prediction power of the model.

AUC of a perfect predictive model equals 1.

auc(diabetes~logit_2$fitted.values, data = newdata2)

The area under the curve of model ‘logit_2’ is 0.863.

In the next article, we’ll be learning about another widely used logistic regression technique — Ordinal Logistic Regression

Thanks!