Supervised Learning in R: Logistic Regression

Fatih Emre Ozturk, MSc
5 min readOct 2, 2023

--

In the previous posts ofsupervised learning in R series, we covered linear regression problems. We saw that linear regression models are great if we want to predict continous variables. However, it is not possible for them to classify any discrete variables.

Assume that we have a data that contains information about weights of cats and whether they loves salmon meat or not. If we would visualize this data set, it would look something like the following.

To classify a data like this, we can use Logistic Regression which is an algorithm used to determine the cause-and-effect relationship between a response variable and explanatory variables in categorical, binary, ordinal, and multicategorical form.

If we were applying linear regression to the data like mentioned before(even if it is not possible) it would look like the following:

However, in logistic regression, line will look like the following:

In this image, y-axis represents probability that goes grom zero to on. The graph’s line indicates the probability that a cat will love salmon, therefore when the line is close to the graph’s top, there is a high probability that a cat will.

At this point, one of the most important feature will be the threshold for classification. Depending on the classification problem we are dealing with, threshold might change. For the sake of this post that assume that our threshold is 0.5 and the graph will be look like the following:

Now, if we have a new cat that we knows she is 4.5 kg, by looking at the graph we can know that she loves salmon like the following:

How do we fit a sigmoid line to data?

As we covered in “How to fit a line?” post, in Linear Regression we fit a line to data by SSR. In contrast, Logistic Regression swaps out the Residuals for Likelihoods and fits a sigmoid line that represents Maximum Likelihood. To calculate this we are calculating likelihoods for cats love salmon and cats do not love salmon. We can calculate all data’s likelihood by multiplying each observation’s likelihood. The goal in there to find line with maximum likelihood.

Logistic Regression in R

In R, glm function in statspackage can be used to build logistic regression models. However, at first, we will create a data for this example:

catgrams <- sample(500:8000, 100, replace = TRUE)
salmon <- ifelse(catgrams > 5000, 1,
ifelse(catgrams < 3000, 0, sample(0:1, 100, replace = TRUE)))



df <- cbind(catgrams, salmon)

Now, we can build logistic regression model:

logisticreg<-glm(salmon~catgrams,# formula
data = df, #data
family=binomial # binomial logistic regression
)

And if we want to interpret model’s output, we can use, like always, summary() function:

summary(logisticreg)
Call:
glm(formula = salmon ~ catgrams, family = binomial, data = df)

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.2477694 1.3020248 -4.799 1.60e-06 ***
catgrams 0.0015772 0.0003112 5.068 4.03e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 137.628 on 99 degrees of freedom
Residual deviance: 43.254 on 98 degrees of freedom
AIC: 47.254

Number of Fisher Scoring iterations: 6

In the logistic regression model, the coefficients are estimated by using the logarithm of the joint probability function to obtain maximum likelihood estimates. There is no closed-form analysis for the beta coefficients. Estimation values are calculated iteratively. For example, when we look at the output, you can see that the number of iterations is 6.

The interpretation of the coefficients is different from the linear regression model. This is because in the logistic regression model, the output value is probability values between 0 and 1. The coefficients do not affect the probability value linearly. However, the log(odds) function is a linear function of the coefficients. In this way, we can make a comment about the coefficients. To determine the change in the prediction value when we increase the value of the independent variable by one unit, we first apply the exp() function to both sides of the log(odds) formula. Then we look at what happens when we increase the value of one of the independent variables by 1 unit. However, instead of looking at the differences, we look at the ratio of the two estimates.

Characteristics of the odds ratio:

Odds ratio (OR) cannot be negative by formula and can take values between 0 and infinity.

Interpretation of OR according to the value it takes can be done as follows:

When OR = 1, it can be said that the factor of interest has no effect on increasing or decreasing the probability of the situation under investigation.

When OR < 1, the factor of interest has a decreasing effect on the probability of the situation under investigation.

When OR>1, the factor of interest has an increasing effect on the probability of the situation under investigation.

Interpretation of Coefficients

A 1-unit increase in an independent variable changes the odds ratio exp(Beta) times. Another interpretation could be as follows: A one unit increase in an independent variable changes the log odds ratio by the relevant coefficient Beta. Since the meaning of the log odds ratio is not very clear, coefficients are usually interpreted in terms of odds ratio. To make a coefficient interpretation using the cat data set:

The odds of a cat with 1 gram more weight liking salmon is exp(0.0015772) = 1.001578 times that with 1 gram less weight.

Intercept can be interpreted as follows: the estimated odds ratio is exp(beta zero) when all independent dummy variables are zero and all categorical independent variables are in the reference category. However, the beta zero interpretation does not make sense.

More details about this will be examined in the Logistic Regression project post.

Assumptions

  • In binary logistic regression, the dependent variable should have two levels.
  • Observations should be independent of each other.
  • There should be no multicollinearity between independent variables. In other words, there should not be high correlation between independent variables.
  • A linear relationship between the dependent variable and the independent variable is not required, but a linear relationship between the independent variable and the log(odds) values of the dependent variable is required.
  • The sample size should be quite large. Proposed equation:
    n = 10k / p
    where,
    k: number of independent variables,
    p: expected probability of success.

So if there are 5 independent variables and p is 0.10:
n = 10.5/0.10
the required sample size should be 500.

Just like always:

“In case I don’t see ya, good afternoon, good evening, and good night!”

--

--