R Applications — Part 3: Logistic Regression

Published in

Data Science Earth

6 min readFeb 23, 2021

The dependent variable we tried to explain in linear regression consisted of continuous data. In order to explain the dependent variable with logistic regression analysis, the dependent variable should consist of categorical data (patient-healthy, good-bad, etc.). A numerical value can be assigned to these data (such as 0 and 1). We can model such data sets with logistic regression.

So how does logistic regression make this prediction? A probability value is calculated for each data in the dependent variable, and then a threshold value is determined, and observations are attempted to be estimated. The threshold value is usually set at 0.5, but this may differ in some studies. Values below 0.5 are rounded to 0, values above are rounded to 1. Thus estimates are obtained for each observation and then compared with actual values. Finally, an accuracy percentage is created.

Let’s examine how logistic regression makes this prediction.

Logistic Function

Let’s first consider a data set with a one independent variable. The function used to estimate the dependent variable data corresponding to an observation in the independent variable is p = 1 / [1 + exp (-y)]. Here,

y = b0 + b1 * x and p is the probability of the event occuring. We can express this function as p = 1 / [1 + exp (- (b0 + b1 * x))] and when we edit it, the final form of the function is p / (1-p) = exp (b0 + b1 * x). Then, to write this function as a linear combination of the independent variable, logarithmic transformation is applied to both sides of the equation: log [p / (1-p)] = b0 + b1 * x.

When there are multiple arguments, the function is as follows: log [p / (1-p)] = b0 + b1 * x1 + b2 * x2 +… + bn * xn.

Now we can start the analysis on a data set.

Data Set

For analysis, “PimaIndiansDiabetes2” dataset in “mlbench” library will be used in R. Let’s introduce our data set to the system first:

install.packages(“mlbench”)
library(mlbench)
data(“PimaIndiansDiabetes2”)

In this data set, 768 people were tested to find out if they had diabetes and 2 different categorical values, positive and negative, were specified in the diabetes variable. Our aim here is to try to estimate the variable of class diabets. Let’s look at the descriptive statistics of the data set:

Descriptive Statistics

summary(PimaIndiansDiabetes2)

When we look at the descriptive statistics, we see that there are missing values. Let’s continue the analysis by removing the missing values from the data set. The “na.omit” function can be used in R to exclude missing values from the data set:

PimaIndiansDiabetes2<-na.omit(PimaIndiansDiabetes2)

Thus, missing values are removed from the dataset. Let’s look at the descriptive statistics again.

summary(PimaIndiansDiabetes2)

Thus, we can obtain estimates by analyzing whether there are diabetes patients for a total of 392 people. Before we start modeling, let’s create the matrix plot.

Matrix Plot

In R, a matrix plot can be created using the plot function.

plot(PimaIndiansDiabetes2)

Here we can see the relationship between the variables. Let’s start building the model. First, we can start with simple logistic regression.

Simple Logistic Regression

Let’s try to explain diabetes variable with glucose variable with simple logistic regression analysis. The “glm” function is used in R for logistic regression. The codes are shown below:

model<-glm(diabetes~glucose,data=PimaIndiansDiabetes2,family = binomial)

In order to make the estimation with logistic regression in R, “family” parameter should be determined as “binomial”. Now let’s look at the summary of the model. For this, the summary function is used in R:

summary(model)

When we look at the summary output, we can say that the glucose variable is significantly. In the field of Estimate, we see the estimates of the regression beta coefficients. Now we can create the logistic regression model based on these values:

p=exp(-6.09+0.042*glucose)/[1+exp(-6.09+0.042*glucose)]

When glucose values are substituted in the function, predicted probability values for diabetes can be obtained.

Multiple Logistic Regression

Let’s create a multiple logistic regression model using all variables in the data set and interpret the summary output. R codes are shown below:

model<-glm(diabetes~.,data=PimaIndiansDiabetes2,family = binomial)
summary(model)

When we look at the summary output, we see that there are significant and not significant variables. We can say that it is significant for glucose, mass, pedigree and age variable. Estimating variables that are not sgnificant can also lead to unsuccessful results. Therefore, analysis should be continued with variables that best explain the diabetes variable. However, stepwise regression should be used to make precise comments about variables. Let’s look at the variables that need to be removed from the model using stepwise regression.

Stepwise Logistic Regression

In order to apply stepwise logistic regression, the “stepAIC” function must be used inside the “MASS” library in R:

install.packages(“MASS”)
library(MASS)
stepAIC(model,trace=FALSE)

In addition to glucose, mass, pedigree and age variables, we see that pregnant variable remains in the model. The variables that best describe our model according to Stepwise regression are shown here. Let’s re-model using these variables:

model<-glm(diabetes~pregnant+glucose+mass+pedigree+age,data=PimaIndiansDiabetes2,family=binomial)

Now let’s check the assumptions of logistic regression:

Assumptions

There should be no influential observations (outliers) and multicollinearity problems for logistic regression analysis. First, we can look at influential observations. R codes are shown below:

plot(model,which=4,id.n=3)

Here we see the cook’s distance values. However, based on these observations, it is not possible to make definitive comments about the outliers. For this, standardized residual values are checked. When standardized residues take values between 3 and -3 there are no influential observations, they can be interpreted. Let’s examine the codes:

std.resid<-rstandard(model)
z<-abs(std.resid)>3
table(z)[“TRUE”]

When we interpret the output, we can say that there is no influential observation. Let’s examine the problem of multicollinearity problem between variables. For this, the vif function is used in the “car” package in R.

car::vif(model)

Multicollinearity

We see that the vif values are small. There is no multicollinearity problem between the variables. Now, let’s finally compare the real values with the predicted values and calculate the accuracy rate. Let’s set the Threshold value as 0.5. The test of those over 0.5 is positive and the others are negative. R codes are shown below:

probabilities<-model %>% predict(PimaIndiansDiabetes2,type=”response”)
predicted.classes<-ifelse(probabilities>0.5,”pos”,”neg”)
observed.classes<-PimaIndiansDiabetes2$diabetes
mean(predicted.classes==observed.classes)

We can say that the accuracy is 78%. When we predict the test results for diabetes with the logistic regression model, we achieve a success of 78%. So finally, let’s set the threshold value as 0.4 and 0.6 and examine the results:

As can be seen, when the threshold value is 0.4, an 80% success is achieved. In this study, the threshold value should be determined as 0.4.

Thus, we learned how to model a dependent variable of categorical data with logistic regression. See you in my next article.

Have a nice day :)

REFERENCES

Grace Whaba, Chong Gu, Yuedong Wang, and Richard Chappell (1995), Soft Classification a.k.a. Risk Estimation via Penalized Log Likelihood and Smoothing Spline Analysis of Variance, in D. H. Wolpert (1995), The Mathematics of Generalization, 331–359, Addison-Wesley, Reading, MA.
Brian D. Ripley (1996), Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge.
Alboukadel Kassambara (2017), Machine Learning Essentials — Practical guide in R.