Learning Machine Learning — Part 3: Logistic Regression

Ryan Gotesman
6 min readApr 4, 2018

--

This is a continuation of my Learning Machine Learning series. You can find Part 2 here.

Week 3 of Andrew Ng’s ML course covered the Logistic regression classification method. Logistic regression is useful for when you have 2 or more distinct groups or classes and you want to determine which group/class your data falls into.

An initial question might be why we can’t use the linear regression technique we learned in the previous week to achieve this. And there are at least 2 good reasons.

Let’s imagine we have the situation on the left below with 2 classes, 0 and 1 and a set of training examples that fall into each category. We could fit our line (black) to the data and say something like, if for any new input our model predicts a value greater than 0.5, the input belongs to class 1, if it predicts less than 0.5 the input belongs to class 0. Fair enough. However if we add just one more training example as we do on the right below, we see our model changes drastically. Even though, to our eyes, its clear the extra blue star shouldn’t change the model, it does. This sensitivity to outliers is one reason linear regression is a poor choice for classification problems.

Another reason is that linear regression models output values that are continuous and can be far greater than 1 and far less than 0. Since our classes are discrete, only consisting of 0 and 1, it does not seem appropriate to use linear regression to solve this problem. Logistic regression to the rescue.

The true utility of Logistic regression stems from curve we use to model the data which in its simplest form is given by:

If we plot this function we’ll find that as z approaches infinity, the logistic function approaches 1 and as z approaches negative infinity the function approaches 0. We can view the output of this function as the probability that the input belongs to class 1. If the probability exceeds 0.5 then we say it belongs to class 1. If it is below 0.5 we say the input belongs to class 0.

But when will the logistic function be greater than 0.5. From the graph its immediately obvious this occurs when the input is greater than 0. So if we consider a parameterized version of logistic regression, where our input vector x has n features and we have a corresponding parameter θj for each feature xj we can rewrite the logistic function as:

where we’ll predict 1 when theta transpose x exceeds 0.

If we consider a simple case where the input vector x has 2 features, x1 and x2, we can see that theta transpose x becomes:

When this expression exceeds 0, h(x) >0.5 so we say x belongs to class 1 and when its less than 0, h(x)<0.5 so we say x belongs to class 0. In this particular case, the expression defines a line, with each side of the line belonging to class 1 or 0. We call this line the decision boundary though it need not be a line. It can be a curve or in higher dimensions a hyperplane. The point is it demarcates the regions belonging to class 1 and 0.

Image result for logistic regression decision boundary

Now that we have the general form of our hypothesis we need to figure out a way to optimize the parameters based on our training data. Once again, we shall achieve this using the idea of a cost function. The cost function for logistic regression is more complicated than for linear regression so we’ll work up to it in two parts.

We’ll begin by defining some initial cost as:

We can see that the cost varies based on the class the training input belongs to. If it belongs to class 1, the cost takes the form of -log(h(x)) (blue curve below). We can see why this makes sense since if our hypothesis outputs 1 when y=1, this is correct and we want the cost to be 0. As well, if the hypothesis outputs 0 when y=1, this is incorrect and we want the cost to be very high. If the input belongs to class 0 the cost takes the form of -log(1-h(x)) (red curve) and we can see why this is appropriate by similar logic.

We can take this cost and use it to write out the cost function J in a single line which becomes:

This is really the same thing since if y=0 the first term of the sum disappears and if y=1 the second term of the sum vanishes.

We want to find the parameters that minimize the value of the cost function and once again we can find these parameters through gradient descent. When we take the partial derivative of the cost function and plug it into our gradient descent formula, once again we get the familiar algorithm:

You’ll note the summation term is the exact same form as the one you get when deriving gradient descent for linear regression. It is different however because in this case the hypothesis is a logistic function, not a linear one.

The tools we have developed for solving binary classification problems can also be applied to problems where we have 3 or more groups. We separately consider each group in the training set, give it a “value” of 1, give all the other classes a “value” of 0 and then run our regular logistic regression to get a decision boundary. If we have n different classes, this method will yield n different logistic regression models. When we plug in a new input x, each model will output the probability that x belongs to that class and by taking the maximum value we achieve classification.

The course then discusses the problem of overfitting, which is when our hypothesis tries so hard to model every detail of the training set that it generalizes poorly to other data. One method to reduce overfitting is called regularization which involves adding an additional weight to some parameters so that they are prevented from contributing to our model. This stops the model from becoming too “complex” and hopefully avoids overfitting.

I have seen regularization pop up several times before in ML discussions and didn’t find this week’s regularization lectures that enlightening. Hopefully the topic will be covered in greater detail later on.

Overall I learned a great deal this week and am looking forward to week 4 on Neural Network representation.

--

--