Machine Learning for Humans, Part 2.2: Supervised Learning II
Classification with logistic regression and support vector machines (SVMs).
This series is available as a full-length e-book! Download here. Free for download, contributions appreciated (paypal.me/ml4h)
Classification: predicting a label
Is this email spam or not? Is that borrower going to repay their loan? Will those users click on the ad or not? Who is that person in your Facebook picture?
Classification predicts a discrete target label Y. Classification is the problem of assigning new observations to the class to which they most likely belong, based on a classification model built from labeled training data.
The accuracy of your classifications will depend on the effectiveness of the algorithm you choose, how you apply it, and how much useful training data you have.
Logistic regression: 0 or 1?
Logistic regression is a method of classification: the model outputs the probability of a categorical target variable Y belonging to a certain class.
A good example of classification is determining whether a loan application is fraudulent.
Ultimately, the lender wants to know whether they should give the borrower a loan or not, and they have some tolerance for risk that the application is in fact fraudulent. In this case, the goal of logistic regression is to calculate the probability (between 0% and 100%) that the application is fraud. With these probabilities, we can set some threshold above which we’re willing to lend to the borrower, and below which we deny their loan application or flag the application for further review.
Though logistic regression is often used for binary classification where there are two classes, keep in mind that classification can performed with any number of categories (e.g. when assigning handwritten digits a label between 0 and 9, or using facial recognition to detect which friends are in a Facebook picture).
Can I just use ordinary least squares?
Nope. If you trained a linear regression model on a bunch of examples where Y = 0 or 1, you might end up predicting some probabilities that are less than 0 or greater than 1, which doesn’t make sense. Instead, we’ll use a logistic regression model (or logit model) which was designed for assigning a probability between 0% and 100% that Y belongs to a certain class.
How does the math work?
Note: the math in this section is interesting but might be on the more technical side. Feel free to skim through it if you’re more interested in the high-level concepts.
The logit model is a modification of linear regression that makes sure to output a probability between 0 and 1 by applying the sigmoid function, which, when graphed, looks like the characteristic S-shaped curve that you’ll see a bit later.
Recall the original form of our simple linear regression model, which we’ll now call g(x) since we’re going to use it within a compound function:
Now, to solve this issue of getting model outputs less than 0 or greater than 1, we’re going to define a new function F(g(x)) that transforms g(x) by squashing the output of linear regression to a value in the [0,1] range. Can you think of a function that does this?
Are you thinking of the sigmoid function? Bam! Presto! You’re correct.
So we plug g(x) into the sigmoid function above, resulting in a function of our original function (yes, things are getting meta) that outputs a probability between 0 and 1:
Here we’ve isolated p, the probability that Y=1, on the left side of the equation. If we want to solve for a nice clean β0 + β1x + ϵ on the right side so we can straightforwardly interpret the beta coefficients we’re going to learn, we’d instead end up with the log-odds ratio, or logit, on the left side — hence the name “logit model”:
The log-odds ratio is simply the natural log of the odds ratio, p/(1-p), which crops up in everyday conversations:
“Yo, what do you think are the odds that Tyrion Lannister dies in this season of Game of Thrones?”
“Hmm. It’s definitely 2x more likely to happen than not. 2-to-1 odds. Sure, he might seem too important to be killed, but we all saw what they did to Ned Stark…”
Log-odds might be slightly unintuitive but it’s worth understanding since it will come up again when you’re interpreting the output of neural networks performing classification tasks.
Using the output of a logistic regression model to make decisions
The output of the logistic regression model from above looks like an S-curve showing P(Y=1) based on the value of X:
To predict the Y label — spam/not spam, cancer/not cancer, fraud/not fraud, etc. — you have to set a probability cutoff, or threshold, for a positive result. For example: “If our model thinks the probability of this email being spam is higher than 70%, label it spam. Otherwise, don’t.”
The threshold depends on your tolerance for false positives vs. false negatives. If you’re diagnosing cancer, you’d have a very low tolerance for false negatives, because even if there’s a very small chance the patient has cancer, you’d want to run further tests to make sure. So you’d set a very low threshold for a positive result.
In the case of fraudulent loan applications, on the other hand, the tolerance for false positives might be higher, particularly for smaller loans, since further vetting is costly and a small loan may not be worth the additional operational costs and friction for non-fraudulent applicants who are flagged for further processing.
Minimizing loss with logistic regression
As in the case of linear regression, we use gradient descent to learn the beta parameters that minimize loss.
In logistic regression, the cost function is basically a measure of how often you predicted 1 when the true answer was 0, or vice versa. Below is a regularized cost function just like the one we went over for linear regression.
Don’t panic when you see a long equation like this! Break it into chunks and think about what’s going on in each part conceptually. Then the specifics will start to make sense.
The first chunk is the data loss, i.e. how much discrepancy there is between the model’s predictions and reality. The second chunk is the regularization loss, i.e. how much we penalize the model for having large parameters that heavily weight certain features (remember, this prevents overfitting).
We’ll minimize this cost function with gradient descent, as above, and voilà! we’ve built a logistic regression model to make class predictions as accurately as possible.
Support vector machines (SVMs)
“We’re in a room full of marbles again. Why are we always in a room full of marbles? I could’ve sworn we already lost them.”
SVM is the last parametric model we’ll cover. It typically solves the same problem as logistic regression — classification with two classes — and yields similar performance. It’s worth understanding because the algorithm is geometrically motivated in nature, rather than being driven by probabilistic thinking.
A few examples of the problems SVMs can solve:
- Is this an image of a cat or a dog?
- Is this review positive or negative?
- Are the dots in the 2D plane red or blue?
We’ll use the third example to illustrate how SVMs work. Problems like these are called toy problems because they’re not real — but nothing is real, so it’s fine.
In this example, we have points in a 2D space that are either red or blue, and we’d like to cleanly separate the two.
The training set is plotted the graph above. We would like to classify new, unclassified points in this plane. To do this, SVMs use a separating line (or, in more than two dimensions, a multi-dimensional hyperplane) to split the space into a red zone and a blue zone. You can already imagine how a separating line might look in the graph above.
How, specifically, do we choose where to draw the line?
Below are two examples of such a line:
Hopefully, you share the intuition that the first line is superior. The distance to the nearest point on either side of the line is called the margin, and SVM tries to maximize the margin. You can think about it like a safety space: the bigger that space, the less likely that noisy points get misclassified.
Based on this short explanation, a few big questions come up.
1. How does the math behind this work?
We want to find the optimal hyperplane (a line, in our 2D example). This hyperplane needs to (1) separate the data cleanly, with blue points on one side of the line and red points on the other side, and (2) maximize the margin. This is an optimization problem. The solution has to respect constraint (1) while maximizing the margin as is required in (2).
The human version of solving this problem would be to take a ruler and keep trying different lines separating all the points until you get the one that maximizes the margin.
It turns out there’s a clean mathematical way to do this maximization, but the specifics are beyond our scope. To explore it further, here’s a video lecture that shows how it works using Lagrangian Optimization.
The solution hyperplane you end up with is defined in relation to its position with respect to certain x_i’s, which are called the support vectors, and they’re usually the ones closest to the hyperplane.
2. What happens if you can’t separate the data cleanly?
There are two methods for dealing with this problem.
2.1. Soften the definition of “separate”.
We allow a few mistakes, meaning we allow some blue points in the red zone or some red points in the blue zone. We do that by adding a cost C for misclassified examples in our loss function. Basically, we say it’s acceptable but costly to misclassify a point.
2.2. Throw the data into higher dimensions.
We can create nonlinear classifiers by increasing the number of dimensions, i.e. include x², x³, even cos(x), etc. Suddenly, you have boundaries that can look more squiggly when we bring them back to the lower dimensional representation.
Intuitively, this is like having red and blue marbles lying on the ground such that they can’t be cleanly separated by a line — but if you could make all the red marbles levitate off the ground in just the right way, you could draw a plane separating them. Then you let them fall back to the ground knowing where the blues stop and reds begin.
In summary, SVMs are used for classification with two classes. They attempt to find a plane that separates the two classes cleanly. When this isn’t possible, we either soften the definition of “separate,” or we throw the data into higher dimensions so that we can cleanly separate the data.
In this section we covered:
- The classification task of supervised learning
- Two foundational classification methods: logistic regression and support vector machines (SVMs)
- Recurring concepts: the sigmoid function, log-odds (“logit”), and false positives vs. false negatives,
In Part 2.3: Supervised Learning III, we’ll go into non-parametric supervised learning, where the ideas behind the algorithms are very intuitive and performance is excellent for certain kinds of problems, but the models can be harder to interpret.
Practice materials & further reading
2.2a — Logistic regression
Data School has an excellent in-depth guide to logistic regression. We’ll also continue to refer you to An Introduction to Statistical Learning. See Chapter 4 on logistic regression, and Chapter 9 on support vector machines.
To implement logistic regression, we recommend working on this problem set. You have to register on the site to work through it, unfortunately. C’est la vie.
2.2b—Down the SVM rabbit hole
More from Machine Learning for Humans 🤖👶