Basics and Beyond: Logistic Regression

Taking apart the logistic regression algorithm piece by piece

Kumud Lakara
Analytics Vidhya
10 min readDec 31, 2020

--

Photo by Dan Meyers

This post will walk you through logistic regression from the very basics. In order to master machine learning it is imperative to have the basics very clear. It may seem exhausting at first but once you have the basics perfect, writing the code will be a cake walk. Most posts try to cover everything in one go and that honestly becomes overwhelming at times. Trying to understand complex equations and then switching over to code then back to the mathematics just makes it difficult to keep up. Well this series aims to guide you through machine learning with a slightly different approach. We will first take apart the algorithm and understand it in absolute detail after which we will move on to implementation in a following post.

This post follows the post in the same series on linear regression so I would recommend going through that as well because some concepts introduced and explained in that post are directly used here:

Alright, lets get started!

Logistic Regression comes under the category of supervised machine learning algorithms. Supervised learning broadly covers two types of problems:

  1. Regression problems
  2. Classification problems (our focus in this post)

Logistic regression comes under the category of classification problems. But what exactly are these “classification” problems? Well, in simple words classification problems try to predict results in a discrete output. They try to map variables into discrete categories. It also helps to remember that when the target variable we are trying to predict is discrete(e.g. in mathematical sense {4, 11} is discrete where as [4, 11] is a continuous set) then it is a classification problem.

Some examples of classification problems are :

  • Predicting whether user will buy a product or not given the product details
  • Predicting whether a person will survive or not given the titanic dataset
  • Classifying emails as spam or not
  • Classifying tumor as malignant or benign

The variable we try to predict in these problems is y:

y = 0: absence of something (e.g. user wont buy, passenger wont survive, benign tumor etc.)

y = 1: presence of something (e.g. customer will buy, passenger will survive, malignant tumor etc.)

In all the above examples we can see that one thing is similar and that is the output space. The output is not any continuous value it is in fact usually a discrete set (think about it like a yes/no question and take a look at the examples above once again). Hence, our output space is discrete.

Example output for logistic regression
Example output for logistic regression

The other type of supervised machine learning problems i.e regression problems are solved using linear regression.

In any supervised learning problem, our goal is simple:

“Given a training set, we want to learn a function h: X →Y so that h(x) is a good prediction for the corresponding value of y“

Here h(x) is called the hypothesis function and is basically what we are trying to predict through our learning algorithm (Logistic Regression in this case).

machine learning basic flow diagram
h maps x to y

The major difference between linear regression and logistic regression is the hypothesis function h(x). Lets start off with binary classification and then we can easily expand this view to multi-class classification as well.

The hypothesis function

We want our classifier to output values between 0 and 1 for this purpose we need to use a special hypothesis function that can map our values between 0 and 1. There are many such functions available but the sigmoid function seems to perform the best when it comes to logistic regression. So lets take a look at the hypothesis equation for logistic regression:

Equationa for hypothesis and sigmoid function
Equation for hypothesis and sigmoid function

g(z) is the sigmoid function and in our case z = θ’x where θ’ is the transpose of θ. The hypothesis function now looks like this:

Hypothesis function for logistic regression
Hypothesis function for logistic regression

The sigmoid function is also referred to as the logistic function. The graph of the sigmoid function look like this:

graph of sigmoid function
Graph of sigmoid function

From the graph we can see that sigmoid function does exactly what we want for our classifier. It takes all real values along the X-axis and maps them to between [0,1] on the Y-axis.

But wasn’t our output supposed to be discrete?

Well yes, the output of our “classifier” is still going to be discrete. Here we are talking about the output of our hypothesis function h(x). You have probably figured out by now that this means that the exact raw output of our hypothesis function is not really the output of our classifier. So now let’s interpret this hypothesis function and see how it can be of help.

The hypothesis function h(x) is not the final output of our classifier, in fact it is the probability that y = 1 given input x. For example: If we consider the example of an email being classified as spam or not then h(x) = 0.6 means that there is a 60% chance of the email being spam.

The hypothesis can therefore also be represented as

h(x) = P(y=1 | x ; θ)

In the above equation the right hand side is the probability that y = 1 given x parameterized by θ. We must remember that θ here is the parameter vector in the hypothesis. Since we are talking about binary classification right now therefore it is quite clear that either y=0 or y=1 hence:

P(y=0 | x ; θ) = 1 — P(y=1 | x ; θ)

The hypothesis function actually creates a line or curve that separates the area where y = 0 and where y = 1. This line or curve is what we call the decision boundary. Now lets finally get to the part we find our output from the hypothesis.

One way of doing so is by predicting y=1 when output of h(x) is greater than or equal to 0.5 and predicting y=0 when h(x) is less than 0.5. Now, lets take a look at the sigmoid function once again:

Sigmoid function
Sigmoid function

So far we have arrived at this conclusion:

y = 1 if h(x) ≥ 0.5 →the area marked yellow in the above graph

y = 0 if h(x) < 0.5 → the area marked pink in the above graph

In the above equations we can switch out h(x) with g(z) keeping in mind our initial discussion regarding the hypothesis and sigmoid function.From the graph it can also be seen that:

g(z) ≥ 0.5 when z ≥ 0 (both marked in yellow) and

g(z) < 0.5 when z <0 (both marked in pink).

Alright, so that’s all about the hypothesis function. Now we have a hypothesis to which we can pass the input x and get the binary output of our classifier.

Now that we have a hypothesis we need to evaluate how well our hypothesis works. For this we need to calculate the “cost” of our predictions which is basically a measure of how close our predictions are to the real values. This is where cost function comes into the picture.

Cost Function

The cost function essentially finds the “cost” we want our model to incur if our output is h(x) and the actual output was supposed to be y. Therefore, it becomes quite intuitive that the cost should in fact be proportional to the difference between h(x) and y.

Well we cant use the same cost function as we did for linear regression because the output of logistic regression will be “wavy” and hence will cause many local optima.

The cost function for linear regression is as shown:

cost function
Cost function for logistic regression

We can’t use this cost function for logistic regression, the reason being the difference in the hypotheses of linear regression and logistic regression. The hypothesis for logistic regression involves a sigmoid function and is hence a complex non-linear function. If we were to take this non-linear h(x) and put it in the above equation for J(θ) we would get a non-convex function. Now this is a problem because a complicated non-convex function would have many local optimum and hence gradient descent will become increasingly difficult. For this reason we want a convex function so that gradient descent can find the global minimum.

3d plot of non-convex gradient descent
3d plot of non-convex gradient descent

Alright now lets see how we can achieve this convex cost function with our h(x) for logistic regression. First lets start by replacing the squared error term in the previous equation with:

Cost variable

So now our cost function looks something like this:

Cost

Now lets define this Cost:

cost

This is actually our logistic regression cost function. Don’t worry if this doesn’t make intuitive sense yet. A graphical representation should be of help here:

Plot for cost(h(x),y) when y=0
Plot for cost(h(x),y) when y=0
Plot for cost(h(x),y) when y=1
Plot for cost(h(x),y) when y=1

From the above graphs we can develop a mathematical understanding of the cost function as follows:

Cost(h(x),y) = 0 if h(x) = y

Cost(h(x),y) →∞ if y = 0 and h(x) →1

Cost(h(x),y) →∞ if y = 1 and h(x) →0

So this means that if the correct answer is supposed to be y=0 then our cost function will be 0 if our hypothesis function also outputs 0. But if our hypothesis approaches 1 then our cost function approaches infinity(∞). Similarly, if the correct answer is 1 then our cost function will be 0 if our hypothesis outputs 1and it will approach infinity(∞) if h(x)=0. This is exactly what we want our cost function to do. We want the model to incur 0 cost if it predicts correctly and maximum cost if it predicts the exact opposite of the correct label(y). Defining the cost function like so guarantees that J(θ) is convex for logistic regression.

Now lets simplify our Cost function using the above defined logic like so:

cost

You can verify this expression for cost function by plugging in y=0 and evaluating the expression and then plugging in y=1 and evaluating the cost again. You will realize you get the same equations for cost as we had defined earlier.

Now the entire cost function becomes:

Alright, so now we have our cost function. The next step of course is optimization of our parameters and for this we make use of gradient descent.

Gradient Descent

Gradient descent is a pretty vast topic in itself. For a detailed insight into gradient descent be sure to check out the post in the same series that walks you through gradient descent starting from the very basics:

For our purpose here we will skip right to the algorithm:

Gradient descent for logistic regression

Here we can see that this algorithm is actually the same as the one we saw in the case of linear regression. The only difference however is the definition of the hypothesis function h(x) which is the sigmoid function in case of logistic regression.The above equation is the main “update” step of gradient descent where after minimizing the cost we attempt to update our parameters in the right direction. α here is the learning rate.

Well that’s about it. That’s all there is to logistic regression. So far we have talked about binary classification for the sake of easy understanding. However, our approach can easily be extended to the case of multi-class classification problems as well.Lets take a quick look.

Multi-Class Classification: One-vs-All

We use this approach when we have…well, “multiple” classes (more than 2 classes). So our y is no longer just 0 or 1, but y can take other discrete values as well. Remember that although y={0,1,….,n}, y is still discrete.

The approach here is actually quite simple. We divide the problem into multiple (n+1 to be precise) binary classification problems. In each one we predict the probability that y is a member of one of our classes. So basically we choose one class and lump all the other classes together into a single second class.This now becomes a binary classification problem and we know how to work our way through those. The output of this mini binary classification problem will give us the probability of the example being in the class we chose as the first class which wasn’t lumped together with the others. Now we just repeat this for each class and then use the hypothesis that returned the highest value as our prediction because that hypothesis shows maximum probability of our prediction being right. Hence we just ultimately pick the class that maximizes h(x) or the probability that y=1.

Binary classification vs multi-class classification
Binary classification vs multi-class classification

That’s it!

Great work! You now know what logistic regression is and how it works. Don’t worry if you didn’t understand all the mathematics the first time. It can seem a bit daunting at first but the secret to mastering it is to keep going over the basics and have them absolutely clear. Well now you should be able to make perfect sense out of any code that implements logistic regression. Till the implementation post in this series I would recommend you look at some implementations of logistic regression and try to take the code apart and understand how everything works.

--

--

Kumud Lakara
Analytics Vidhya

Advanced CS @ Oxford. Crazy about everything AI! Researching in areas of Machine Learning, Deep Learning and Reinforcement Learning.