In-Depth Machine Learning for Teens: Logistic Regression

Endothermic Dragon
8 min readAug 21, 2022

--

This article will be one of the shorter ones compared to the others, but it discusses a few important concepts that will be used extensively in neural networks.

Logistic regression is a lot like regular linear regression, however, it serves a different purpose. As the name implies, it attempts to make a “logical decision,” often to differentiate between two groups. In a way, it’s very similar to linear regression but has some modifications to its prediction function and cost function (and consequently the cost function’s gradients). As usual, the required steps and formulas to take the gradients will be given in the lab, but it’s recommended that you calculate them yourself if you can.

What logistic regression looks like in 2D. Image source

Note From Author

As of now, this article is part of a series with 5 others. It is recommended that you read them in order, but feel free to skip any if you already know the material. Keep in mind that material from the previous articles may (and most probably will) be referenced at multiple points in this article — especially gradient descent, the keystone of all machine learning.

Table of Contents

Survey

If you wouldn’t mind, please fill out this short survey before reading this article! It is optionally anonymous and would help me a lot, especially when improving the quality of these articles.

Logistic Regression Survey

Sigmoid Function

The sigmoid function maps all values to a range between 0 and 1. Specifically, the more negative the input, the closer to 0 its output. In contrast, the more positive the input, the closer to 1 its output. For an input of 0, it will return exactly 0.5.

As the graph shows, for large values of x, this function gets closer and closer to 1. Similarly, for smaller values of x, it gets closer and closer to 0. It’s important to note that this function never actually “touches” 0 or 1, but gets closer and closer in either direction as it continues towards infinity or negative infinity. When a function approaches a value but never actually gets there, it’s called an “asymptote”.

Using this, we simply redefine our prediction function by tacking this onto the linear regression equation, as such:

Now, our prediction function is capable of classifying “positive” and “negative” values and drawing boundaries. As our inner linear model predicts higher positive values, we get a “positive” prediction with a high level of confidence, such as 0.87 or 0.94. Consequently, as our linear model predicts lower negative values, we get a “negative” prediction closer to 0 with high confidence, such as 0.13 or 0.07 (note that the “negative” is in reference to the 0 or 1 classification, not the actual value of our prediction function).

Loss Function

Unlike before, using a regular MSE will not work effectively, as our outputs are restricted specifically between 0 and 1. First of all, we want to “punish” our model much more than we would regularly if it predicts close to 1 when the actual output is 0 (or vice versa). This is because if we backtrack by “unwrapping” our sigmoid function, then an opposite prediction essentially means that our inner linear model is outputting a really positive value when in reality it’s supposed to be really negative (or vice versa). Thus, we must “exaggerate” the loss as such. To do this, we use logs:

Keep reading to see why these are used!

The first function is used if the output is supposed to be 0. As the graph shows, the model is punished a lot for values closer to 1, and less for values closer to 0. The second function has the same pattern except flipped, to punish the model for being closer to 0 when it’s supposed to be 1.

As a side note, it’s important to realize that σ will never output 0 or 1 exactly, as due to its asymptotes, an output of 1 would require an input infinitely large, and an output of 0 would require a negative input infinitely large. Consequently, while our log function rises to infinity for the opposing outputs, since the output will never actually be the exact opposite, we shouldn’t have to worry about encountering mathematical or programming errors involving infinity.

We can combine these two forms and take the average, to get our cost function (with a caveat, that comes up in the next section):

Note that the negative has been moved from the logs to the outside — the two forms are equivalent.

Here, the value of yᵢ acts like a “switch” — when it’s 1, the first part is activated. If it’s 0, then the second part is activated.

Note that we will have to also modify the gradients in accordance with our cost function and modified prediction function. Once again, if you don’t have a background in calculus, these will be provided to you. If you do have a background in calculus, then you’ll probably find it helpful to know that the derivative of σ(x) is σ(x) ⋅ (1-σ(x)).

Bias Term

Once again, we have a bias term in our inner linear equation. This time, it represents something slightly different. To visualize this, you can think of the other parameters as trying to shift away from a central value. If it shifts positively, then the output should be close to 1, if it shifts negatively, then the output should be close to 0. This term helps shift that “center value” to 0 before the sigmoid is applied. Without it, the “center value” would be locked at 0. While technically, the model would be able to train, it won’t be as accurate or robust as it could be.

Regularization

So far, we’re missing one thing in the cost function — a regularization term! A regularization term is added to prevent the values of θ from getting too big. Practically, it prevents overfitting and makes sure our model’s confidence isn’t always super high. It also ensures that our model converges to a minimum, as without it, the values of θ could continue to increase for as long as the training process runs.

In the lab, we will use something called L2 regularization. Essentially, you simply square all the θ’s except the bias θ, add them up, and multiply by a constant lambda (λ).

There is also L1 regularization, which is where you do the same thing, except instead of squaring, you take the absolute value.

If you’re wondering how we can choose a good value for λ, then let me tell you right away — there isn’t any hard and fast rule. It’s a process involving intuition and trial and error. Oftentimes, it’s useful to run a few iterations and cross-check the progress using the validation dataset.

Preprocessing Data

Because of the nonlinear nature of the classifier, a skewed dataset typically doesn’t require significant preprocessing, but it’s not necessarily a bad idea to normalize it.

However, taking a Z-score is typically a good idea. This is because it speeds up training as it makes each θ more equal in terms of sensitivity to the input parameters.

Measuring Accuracy

There are many ways to measure the accuracy of a logistic regression model, and none of them are necessarily “better” than the other. The simplest method is to perform predictions with the validation dataset, and see what percent of predictions match with the true values.

However, sometimes, you’ll be facing a skewed dataset. Let’s say it’s a dataset about cancer tumors with 500 data points, 5% of the data is diagnosed as cancer-positive. In addition, let’s say you make a logistic classifier, and are trying to judge its accuracy. You can do this using something called the F1 score — but first, you have to draw something called the confusion matrix.

Confusion Matrix

A confusion matrix provides a half-quantitative, half-qualitative measure of whether a model is a good fit. Essentially, it lists the number of true positives, false positives, true negatives, and false negatives.

Click to view as a website

Precision and Recall

Precision measures how often your model is accurate in its positive prediction. It is measured by (true positives) / (true positives + false positives).

Recall measures how often your model successfully identifies positives. It is measured by (true positives) / (true positives + false negatives).

F1 Score

This is given by the formula 2 * (precision * recall) / (precision + recall).

From the confusion matrix above, we can see that our model is clearly not very good at identifying cancerous tumors. If we simply measure what percent of its predictions are correct, we can see we get an accuracy of 60%. However, this is clearly too high and does not match our intuition. Calculating the F1 score, we get a value of 4.55%. This matches our intuition much more, as it shows that our model is really bad at identifying positives.

As seen in the example above, measurements of accuracy can greatly vary depending on what formula you’re using. In addition, the formula can also change based on the context of classification and the dataset.

Hands-On Lab

Parting Notes

Yay, now you can make your own classifier! However, sometimes you want to classify more than one thing. For such an application, you can implement multiple logistic regression models, and put them together. Or, you can use a neural network! Neural networks also have other uses, which we will cover in the next article.

Done reading? Ready to learn more? Check out my other articles in this series!
Neural Networks

Or, feel free to refer back to any previous articles in this series:
Gradient Descent
Linear Regression
Training Faster and Better

--

--

Endothermic Dragon

My name is Eshaan Debnath, and I love computer science and mathematics!