Classification — ISLR Series: Chapter 4 — Part I

Last blog we talked about linear regression: given some data, predict a numerical response. Chapter 4 and this blog goes over the scenario when the response variable is a not a numerical value but a class. This type of machine learning is called classification.

The example that ISLR uses is: given people’s loan data, predict whether they will default or not default. Visually the data will look like the orange lines in Figure 1. If we apply a linear function on this type of data, then it does not do a good job fitting the data (graph on the left). The model (blue line) just barely scratches the data around the Balance of 500. We need to find a function that fits the data better than the linear function, a function like the one on the right of Figure 1. The prediction is accurately meeting the data where the default is 0 and quickly rises to 1 to meet the data. That function is the logistic function and is the basis of the Logistic Regression.

Logistic Regression is a model that can be applied to targets that are represented as a class. A logistic model predicts the probability that an observation belongs in a particular class. Based on these probabilities, we will choose the class with the highest probability. In the loan example, a logistic model might predict probabilities that 20% a person might default and 80% he/she might not default. Since “not default” has the highest probability, we will classify that that person will not default.

Logistic Function

To implement the logistic regression model, we implement the logistic function (surprise!):

This looks a little intimidating but it is something we saw before. If we look closely at the exponent of e, we see the linear regression function. If we rearrange the equation to get rid of the e, we get

The right side looks like the linear regression function. The left hand side of the equation is called log odds or “logits”. This shows us that increasing X by one unit, changes the logits by B1.

For multiple logistic regression the equation are the same except that there will be an additional term for every predictor (just like for linear regression).

Making Predictions

Making predictions is simple. We train the model to estimate the coefficients (B0, B1, etc.) and plug in the value for X. For example, if our B0 = -10.6512, B1 = 0.0055, and our X (observation) is 1000, then the logistic function will tell us that the probability of having 1000 would be 0.00576.

Estimating Coefficients using Maximum Likelihood

For linear regression we use least squares to estimate the regression coefficients. But for logistic regression, we use the maximum likelihood function.

The maximum likelihood function can be broken down to two parts: maximum and likelihood function. The likelihood function fits the probability of a class to occur and we maximize so that the actual class is more probable. The estimates B0 and B1 are chosen to maximize this likelihood function.


When it comes down to evaluating a classification model, a common metric is error rate: the ratio of incorrect predictions and total observations (Check out What is Statistical Learning for more details).

Most of the time the error rate is not enough. What if the classes in the target value are disproportionate? If there are two classes, A and B, what if there are 98 A’s and 2 B’s in the target? The model can always predict A and get an error rate of 2%, which looks good, but a terrible model. In this scenario we would have to use a different metric called sensitivity and specificity.

To explain what they are, we need to use a confusion matrix. A confusion matrix just shows the count of correctly and incorrectly classified observations in a table.

In this confusion matrix, the predicted classes are represented by the columns and the true classes are represented by the rows. The terms “True Negative” and “True Positive” represent when the model predicts the classes accurately. The terms “False Positive” and “False Negative” represent when the model misclassifies the observations.

In the example shown in ISLR, the task was to detect if someone is going to default. The rows show the model prediction (Yes for default and No for no default) and the columns show the actual label. If the true label and the predicted label match (No and No or Yes and Yes) that means the model predicted the label correctly. If they do not match (No and Yes or Yes and No) then that means the model didn’t predict correctly.

Having a better understanding on confusion matrix, we can tackle sensitivity and specificity. Sensitivity is the percentage of the true positives the model predicted out of the all the true positives. In the Default Loan example, 81 observations were predicted as default out of 333 total true positives, giving a specificity of ~24.3%. Based on this model, if the observation is a true positive, then we will be predicting it correctly 24.3% of the time.

Specificity is the percentage of the true negatives the model predicted out of the total true negatives in the data. In the Default Loan example, the model predicted 9,644 observations as no -default from a total of 9,667 observations giving a specificity of ~99.8%.

If we only used the error rate, we would be getting an error rate of 2.6% which sounds great. But if we look at the specificity and sensitivity metric, it tells us a different story. We are not predicting the true positives accurately!

Collaborators: Michael Mellinger


Chapter 4 has a lot more details and goes further by talking about Linear Discriminant Analysis, Quadratic Discriminant Analysis and K-Nearest Neighbor for classification. This topic will be discussed in Part II of Chapter 4.

This is a learning journey for us so if there is something that is incorrect or unclear let us know in the comments and we can clear it up.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store