A Guide to Machine Learning in R for Beginners: Logistic Regression
This is part 5 of my beginner’s series on Machine Learning in R
Imagine you are working as a data scientist for an e-commerce company. One of the company’s task is to send out e-mail offers to customers with a proposal to buy certain products. Your job as a data scientist is to determine whether the contacted person will buy the product or not. All you have is a sample of customers that were contacted recently, their age and a variable whether or not they took action.
So how do we do that? The only way that appears is to contact every person on the list and ask them whether they will buy the product or not. Although this appears to be the only solution, it isn’t the best one.
So as a Data Scientist, you apply your knowledge of Machine Learning to the problem. Clearly, the Linear Regression algorithm will not work here since it only works for problems with a continuous outcome variable. On the other hand, the problem at hand is categorical i.e whether customers will buy a product( =1) or not( =0).
So, Instead of trying to predict exactly whether the people will buy a product or not, you calculate the probability or a likelihood of the person saying yes. Basically you try to fit in probabilities between 0 and 1, which are the two possible outcomes. You also decide a cut off value/threshold and then conclude that people with a probability higher than the threshold will buy the product and vice versa.
And how does it make the work of the company, easier?
Since it gives the probability of people who are more likely to buy a product, it enables the company, to focus only on the customers who are most likely to say Yes.
This the basic intuition behind Logistic Regression. Now let us get to know the math behind it.
What is Logistic Regression
It’s an extension of linear regression where the dependent variable is categorical and not continuous. It predicts the probability of the outcome variable.
Logistic regression can be binomial or multinomial. In the binomial or binary logistic regression, the outcome can have only two possible types of values (e.g. “Yes” or “No”, “Success” or “Failure”). Multinomial logistic refers to cases where the outcome can have three or more possible types of values (e.g., “good” vs. “very good” vs. “best” ). Generally, the outcome is coded as “0″ and “1″ in binary logistic regression.
Representation of Logistic regression
- Logistic Response Function
- Positive values are predictive of class 1
- Negative values are predictive of class 0
The coefficients, or β values, are selected to maximize the likelihood of predicting a high probability for observations actually belonging to class 1 and predicting a low probability for observations actually belonging to class 0. The output of this function is always between 0 and 1
2. Odds Ratio
Odds : P(y=1)/P(y=0)
The odds ratio for a variable in logistic regression represents how the odds change with a 1 unit increase in that variable holding all other variables constant.
Odds > 1 if y = 1 is more likely
Odds < 1 if y = 0 is more likely
3. The Logit
This is called the “Logit” and looks like linear regression
The bigger the Logit is, the bigger is P(y = 1).
The baseline model in case of Logistic Regression is to predict the most frequent outcome as the outcome for all data points. (In the case of Linear regression, the baseline model predicts the average of all data points as the outcome)
Logistic Regression tries to –
- Model the probability of an event occurring depending on the values of the independent variables
- Estimate the probability that an event occurs vs the probability that the event does not occur
- Predict the effect of a series of variables on a binary response variable
- Classify observations by estimating the probability that an observation is in a particular category or not.
The output of a Logistic regression model is a probability. We can select a threshold value. If the probability is greater than this threshold value, the event is predicted to happen otherwise it is predicted not to happen.
A confusion or classification matrix compares the actual outcomes to the predicted outcomes. The rows are labelled with actual outcomes while the columns are labelled with predicted outcomes.
Describing the Performance of a Logistic model
A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known.Let us look at some of the important terms of confusion matrix.
The Confusion Matrix tells us the following:
- There are two possible predicted classes: “yes” and “no”. If we were predicting that employees would leave an organisation, for example, “yes” would mean they will, and “no” would mean they won’t leave the organisation.
- The classifier made a total of 165 predictions (e.g., 165 employees were being studied).
- Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
- In reality, 105 employees in the sample leave the organisation, and 60 do not.
Basic terms related to Confusion matrix
- True positives (TP): These are cases in which we predicted yes (employees will leave the organisation), and employees actually leave i.e 100
- True negatives (TN): We predicted no(employees will not leave the organisation) and they don’t leave i.e 50
- False positives (FP): We predicted yes they will leave, but they don’t leave. (Also known as a “Type I error.”) i.e 10
- False negatives (FN): We predicted no they will not leave, but they actually leave (Also known as a “Type II error.”) i.e 5
Accuracy : (TP+TN)/Total . Describes overall, how often the classifier correct. i.e 100+50/165
Measures of Accuracy
Sensitivity and specificity are statistical measures of the performance of a binary classification test:
Sensitivity/Recall = TP/(TP + FN). When it’s actually yes, how often does it predict yes? i.e 100/(100+5)
Specificity = TN/(TN + FP) .When it’s actually no, how often does it predict no?? i.e 50/(50+10)
Precision = TP/predicted yes. When it predicts yes, how often is it correct?100/(10+100)
Evaluation metrics for a Classification model’s performance.
ROC(Receiver Operator Characteristic Curve) can help in deciding the best threshold value. It is generated by plotting the True Positive Rate (y-axis) against the False Positive Rate (x-axis) as you vary the threshold for assigning observations to a given class.ROC curve will always end at (1,1). The threshold at this point will be 0. This means that we will always classify these observations falling into the class 1(Specificity will be 0. False positive rate is 1)
One should select the best threshold for the trade off you want to make. According to the criticality of the business, we need to compare the cost of failing to detect positives vs cost of raising false alarms.
High Threshold :
- High specificity
- Low sensitivity
- Low specificity
- High sensitivity
The area under ROC is called Area Under the Curve(AUC). AUC gives the rate of successful classification by the logistic model. To get a more in-depth idea of what a ROC-AUC curve is and how is it calculated, here is a link to the article I wrote on the same topic.
Logistic Regression in R
We’ll be using the dataset
quality.csv to build a logistic regression model in R to predict the quality of care in a hospital. PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.
The objective of the dataset is to assess health care quality. Hence, 131 diabetic patients were randomly selected between the ages of 35 and 55.
- Reading data into R console:
> quality = read.csv('quality.csv')
- Analysing the quality dataset
'data.frame': 131 obs. of 14 variables:
$ MemberID : int 1 2 3 4 5 6 7 8 9 10 ...
$ InpatientDays : int 0 1 0 0 8 2 16 2 2 4 ...
$ ERVisits : int 0 1 0 1 2 0 1 0 1 2 ...
$ OfficeVisits : int 18 6 5 19 19 9 8 8 4 0 ...
$ Narcotics : int 1 1 3 0 3 2 1 0 3 2 ...
$ DaysSinceLastERVisit: num 731 411 731 158 449 ...
$ Pain : int 10 0 10 34 10 6 4 5 5 2 ...
$ TotalVisits : int 18 8 5 20 29 11 25 10 7 6 ...
$ ProviderCount : int 21 27 16 14 24 40 19 11 28 21 ...
$ MedicalClaims : int 93 19 27 59 51 53 40 28 20 17 ...
$ ClaimLines : int 222 115 148 242 204 156 261 87 98 66 ...
$ StartedOnCombination: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ AcuteDrugGapSmall : int 0 1 5 0 0 4 0 0 0 0 ...
$ PoorCare : int 0 0 0 0 0 1 0 0 1 0 ...
We have 131 observations, one for each of the patients in our data set, and 14 different variables. The 12 variables from
AcuteDrugGapSmall are the independent variables while
PoorCare is the
dependent/outcome variable .
Creating a Baseline Model
We know good care is more common than poor care. Hence in this case, we would predict that all patients are receiving good care. By doing this . we would get 98/131 observations correct and an accuracy of 75%. So our baseline model has an accuracy of 75%.This is what we’ll try to beat with our logistic regression model.
Splitting Training & Testing Data
Since we have only one data set, we want to randomly split our data set into a training set and testing set. Testing set is essential to validate our results. For splitting the data we will use the
caTools Package. To make sure that we all get the same split, we’ll set our seed.This initializes the random number generator.The package contains
sample.split command to split the data with a split ratio of 0.75.This means we’ll put 75% of the data in the training set, which we’ll use to build the model, and 25% of the data in the testing
set to test our model.
# Install and load caTools package
# Randomly split data
> split = sample.split(quality$PoorCare, SplitRatio = 0.75)
 TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE
 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
 TRUE FALSE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE
 TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE
 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
 TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE
 TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
 FALSE TRUE TRUE TRUE FALSE
There is a TRUE or FALSE value for each of our observations.TRUE means that we should put that observation in the training set, and FALSE means that we should put that observation in the testing set.So now let’s create our training and testing sets using the subset function.
Creating Training & Testing Sets
We will use the
subset function to create the sets.Training set will be called
qualityTrain and testing set
# Create training and testing sets
> qualityTrain = subset(quality, split == TRUE)
> qualityTest = subset(quality, split == FALSE)
There are 99 training samples and 32 testing samples.
Logistic Regression Model
Now, we are ready to build a logistic regression model using
Narcotics as independent variables.We’ll call our model QualityLog and use the “glm” function or “generalized linear model” to build
our logistic regression model. Since we are building the model on training data, we use
qualityTrain .The family argument tells the glm function to build a logistic regression model.
# Logistic Regression Model
> QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics,data=qualityTrain, family=binomial)
glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
data = qualityTrain)
Min 1Q Median 3Q Max
-2.06303 -0.63155 -0.50503 -0.09689 2.16686
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
OfficeVisits 0.08212 0.03055 2.688 0.00718 **
Narcotics 0.07630 0.03205 2.381 0.01728 *
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 111.888 on 98 degrees of freedom
Residual deviance: 89.127 on 96 degrees of freedom
Number of Fisher Scoring iterations: 4
- The coefficients table gives the estimate values for the coefficients,or the betas, for our logistic regression model.We see here that the coefficients for
OfficeVisitsand Narcotics are both positive, which means that higher values in these two variables are indicative of poor care as we suspected from looking at the data.
- We also see that both of these variables have at least one star , meaning that they’re significant in our model.
- AIC value : This is a measure of the quality of the model and is like Adjusted R-squared in that it accounts for the number of variables used compared to the number of observations.A low AIC is desirable.
Making predictions on Training set
Let us call it predictTrain and use the predict function to make predictions using the model QualityLog. We will also use an argument called type=”response” which gives us the probabilities. We should always predict on unseen observations but here we wan’t to get the value of threshold ,hence the predictions on train set.
> predictTrain = predict(QualityLog, type="response")
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.06623 0.11912 0.15967 0.25253 0.26765 0.98456
> tapply(predictTrain, qualityTrain$PoorCare, mean)
tapply function computes the average prediction for each of the true outcomes.
We find that for all of the true poor care cases, we predict an average probability of about 0.44. And for all of the true good care cases, we predict an average probability of about 0.19.
This is good because it looks like we’re predicting a higher probability for the actual poor care cases.
We can convert the probabilities to predictions using what’s called a threshold value, t. If the probability of poor care is greater than this threshold value, t, we predict poor quality care. But if the probability of poor care is less than the threshold value, t, then we predict good quality care.
How to select the value for t:
The threshold value, t, is often selected based on which errors are better.This would imply that t would be best for no errors but it’s rare to have a model that predicts perfectly,
There are two types of errors that this model can make:
1. where model predicts 1, or poor care, but the actual outcome is 0,
2. Where model predicts 0,or good care, but the actual outcome is 1.
To make this discussion a little more quantitative,we use what’s called a confusion matrix or classification matrix.
# Confusion matrix for threshold of 0.5
> table(qualityTrain$PoorCare, predictTrain > 0.5)
0 70 4
1 15 10
# Confusion matrix for threshold of 0.7
> table(qualityTrain$PoorCare, predictTrain > 0.7)
0 73 1
1 17 8
# Confusion matrix for threshold of 0.2
> table(qualityTrain$PoorCare, predictTrain > 0.2)
0 54 20
1 9 16
We see that by increasing the threshold value, the model’s sensitivity decreases and specificity increases while the reverse happens if the threshold value is decreased. So how to choose the optimum threshold value.Picking a good threshold value is often challenging.A Receiver Operator Characteristic curve, or ROC curve, can help us decide which value of the threshold is best.
Install and load ROCR package
# Install and load ROCR package
Recall that we made predictions on our training set and called them predictTrain. We’ll use these predictions to create our ROC curve.
> ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
The first is the predictions we made with our model, which we called predictTrain.The second argument is the true outcomes of our data points,
which in our case, is qualityTrain$PoorCare.
We now use the performance function which defines what we’d like to plot
on the x and y-axes of our ROC curve.
# Performance function
> ROCRperf = performance(ROCRpred, "tpr", "fpr")
This function, takes as arguments the output of the prediction function,
and then what we want on the x and y-axes.
Now, we just need to plot the output of the performance function.
# Plot ROC curve
# Add colors
> plot(ROCRperf, colorize=TRUE)
# Add threshold labels
> plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
The sensitivity, or true positive rate of the model, is shown on the y-axis.while the false positive rate, or 1 minus the specificity, is given on the x-axis. The line shows how these two outcome measures vary with different threshold values.
The ROC curve always starts at the point (0, 0) i.e threshold of value 1.This means at this threshold we will not catch any poor care cases(sensitivity of 0) but will correctly label of all the good care cases(FP = 0)
The ROC curve always ends at the point (1,1) i.e threshold of value 0. This means at this threshold we will catch all the poor care cases(sensitivity of 1) but will incorrectly label of all the good care case as poor cases(FP = 1)
The threshold decreases as you move from (0,0) to (1,1). At the point (0, 0.4), we’re correctly labeling about 40% of the poor care cases with a very small false positive rate. On the other hand, at the point (0.6, 0.9), we’re correctly labeling about 90% of the poor care cases, but have a false positive rate of 60%. In the middle, around (0.3, 0.8), we’re correctly labeling about 80% of the poor care cases, with a 30% false positive rate.
The ROC curve captures all thresholds simultaneously.The higher the threshold, or closer to (0, 0),the higher the specificity and the lower the sensitivity.The lower the threshold, or closer to (1,1),the higher the sensitivity and lower the specificity.
So which threshold value one should pick?
One should select the best threshold for the trade-off one wants to make.If you’re more concerned with having a high specificity or low false positive rate, pick the threshold that maximizes the true positive rate while keeping the false positive rate really low.A threshold around (0.1, 0.5) on this ROC curve looks like a good choice in this case.On the other hand, if one is more concerned with having a high sensitivity or high true positive rate, one should pick a threshold that minimizes the false positive rate
Prediction on Test Set
In this particular example, we used a threshold value of 0.3 and we obtain the following confusion matrix.
> predictTest = predict(QualityLog, type = "response", newdata = qualityTest)
> table(qualityTest$PoorCare,predictTest >= 0.3)
0 19 5
1 2 6
There are total 32 cases in test Set, out of which 24 of them are actually good care, and 8 of them are actually poor care.
The model can accurately identify patients receiving low quality care with test set accuracy being equal to 78% which is greater than our baseline model.
In practice, the probabilities returned by the logistic regression model can be used to prioritize patients for intervention.
This was all about Logistic Regression in R. We studied the intuition and math behind it and also how Logistic regression makes it very easy to solve a problem with categorical outcome variable.