Logistic Regression (now with the scary maths behind it!)

Ishan Shishodiya
ml-concepts.com
Published in
8 min readMay 9, 2022

Logistic Regression is a type of linear model that’s mostly used for binary classification but can also be used for multi-class classification. If the term linear model sounds something familiar, then that might be because Linear Regression is also a type of linear model.

To proceed with this notebook you first have to make sure that you understand ML concepts like Linear Regression, Cost Function, and Gradient Descent and mathematical concepts like Logarithm and Matrices. If you don’t, then the links below can help you out.

  1. Linear Regression
  2. Gradient Descent

If you would like to follow the topic with interactive code then, I have made a Kaggle notebook for this exact purpose. Click here to try it out for yourself!

To understand Logistic Regression properly, you have to first understand a few other concepts too. Think of Logistic Regression like an onion. In the same way, you have to go through multiple layers to reach the sweet juicy middle part of an onion, you have to go through a few concepts before you can understand Logistic Regression from scratch!

(When did onions have a sweet juicy middle part?)

(I don’t know…he probably meant…some fruit?)

Onions aside, let’s first learn about the Decision Boundary.

Decision Boundary

In the simplest terms, a decision boundary is just a line that can help us in identifying which point belongs to which class. The image below can help you understand a decision boundary much more clearly.

Logistic Regression — Decision Boundary
Decision Boundary (Image 1)

Here the blue line separates the two classes which are represented as green and red dots. Any point to the left of the decision boundary belongs to the class represented with the red dots. Any point to the right belongs to the class represented with the green dots. That’s all that a decision boundary does.

It can be calculated using the equation of the straight line itself. The equation of the straight line in the general form can be given as this -

Logistic Regression — Equation of a straight line
Equation of a straight line (Image 2)

Where

  • a is the coefficient of x,
  • b is the coefficient of y.
  • c is some arbitrary constant.

Using this equation we can assume that the equation of the decision boundary is -

Logistic Regression — Equation of a decision boundary
Equation of a decision boundary (Image 3)

Where,

  • x1 is the 1st feature variable.
  • x2 is the 2nd feature variable.

If we are able to calculate x2 values for certain x1 values, then we would be able to plot our decision boundary. This can be done this way.

Logistic Regression — Equation of a decision boundary
Equation of a decision boundary (Image 4)

Now that we have a way to plot the decision boundary, you might think “Why don’t we use Linear Regression for this? It can help us plot a line based on β values.”

It is true that Linear Regression can help us plot a line based on some β values, but the Cost Function of Linear Regression minimizes the distance between the line of best fit and the actual points. This isn’t helpful in classifying points. For ideal classification, we would need to get the probability of something belonging to a certain class and assign that item a class only if the probability is above a certain threshold. From this, you can infer two things,

  1. For Logistic Regression, we’ll need a way to get the values in terms of probabilities.
  2. For Logistic Regression we would need a new Cost Function.

Sigmoid Function

The sigmoid function looks something like this

Logistic Regression — Sigmoid Function
Sigmoid Function (Image 5)

It takes in any series and gives out that series in the terms of probabilities, which restricts it from 0 to 1. Let’s take an example of this.

Suppose I have a list of numbers from -100 to 100, {num | num ∈ [-100, 100]}. If I pass this list inside the sigmoid function, it would be turned into something like this.

Logistic Regression — Sigmoidal value of the series
Sigmoidal value of the series (Image 6)

The graph above gives us the probability of a number being greater than or less than zero. If we say that each number with a corresponding sigmoidal value that is greater than 0.5 is greater than 0, and each number with a corresponding sigmoidal value that is less than 0.5 is less than 0 then we would have the list of all positive and number numbers present in our input list.

We can try to predict the class of an item using β0+β1x1+β2x2. If we plot this line on a graph it would look something like this.

Logistic Regression — Prediction line on the classes of the data
Prediction line on the classes of the data (Image 7)

This line has a problem. No matter what your class names are, one of them is considered class 1 while the other is considered class 0. Meaning that our predictions should always be in the range of 0 to 1, which is something this line doesn’t do. So to fix this, we would pass it inside the sigmoid function. This would make the equation look something like this,

Logistic Regression — Sigmoid Function on the equation of the prediction line
Sigmoid Function on the equation of the prediction line (Image 8)

This equation can be written in the terms of matrices.

Logistic Regression — Sigmoid Function on the matrix of the prediction line
Sigmoid Function on the matrix of the prediction line (Image 9)

Where -

  • B is the matrix with all the regression coefficients.
Logistic Regression — Matrix of betas
Matrix of betas (Image 10)
  • X is the matrix with all the feature values with an added column with 1s.
Logisitic Regression — Matrix of features
Matrix of features (Image 11)

The sigmoid function can help us in differentiating two classes but only when we have the equation of the ideal line to pass into the function. And how can we get the equation of the ideal line? It’s simple. By minimizing the cost function for Logistic Regression.

Cost Function

Just like Linear Regression had MSE as its cost function, Logistic Regression has one too. So let’s derive it.

Likelihood Function

So…we know that Logistic Regression is used for binary classification. Meaning the predictions can only be 0 or 1 (Either it belongs to a class, or it doesn’t). So suppose, the probability of something belonging to class 1 is p, then the probability of it belonging to class 0 would be 1−p.

Logistic Regression — Probabilities of something belonging to a class
Probabilities of something belonging to a class (Image 12)

We can combine these two equations into something like this.

Logistic Regression — Single equation of probabilities of something belonging to a class
Single equation of probabilities of something belonging to a class (Image 13)

If we substitute y with 1 we get the following.

Logistic Regression — Probability of something belonging to class 1
Probability of something belonging to class 1 (Image 14)

If we substitute y with 0 we get the following.

Logistic Regression — Probability of something belonging to class 1
Probability of something belonging to class 1 (Image 15)

This equation is called the likelihood function, and it can give us the likelihood of one item belonging to a class. To get the likelihood function of all the items in a series, we can just multiply the likelihood of all the items.

Logistic Regression — Likelihood function for all items
Likelihood function for all items (Image 16)

Log-Likelihood Function

When we start applying it to a series, the likelihood function would return huge numbers. This would complexify our calculations. So to tackle this problem we can take the log of this function.

Logistic Regression — Log-Likelihood function
Log-Likelihood function (Image 17)

This function takes in the values of pi and 1−pi which range from 0 to 1 (it takes in probabilities).

Let’s plot a log of numbers that fall between 0 and 1.

Logistic Regression — Log of numbers between 0 and 1
Log of numbers between 0 and 1 (Image 18)

As you can see the log of numbers from 0 to 1 is negative. Meaning the whole function P(y) would be negative for all the inputs. So we would multiply −1 with P(y) to fix this.

And one more thing. ∑ni=1(yi log pi+(1−yi)log(1−pi) gives us the sum of all errors and not the mean. So to fix this we can divide the whole equation by n to get the mean of all errors.

Logistic Regression — Cost Function for Logistic Regression
Cost Function for Logistic Regression (Image 19)

And to avoid overfitting, let’s add penalization to the equation just the way we added it to the cost function for Ridge Regression.

Logistic Regression — Log-Likelihood Function
Log-Likelihood Function (Image 20)

This function we have here is also called the Regularized Cost Function and can help us in getting the error values for certain values of βs.

Gradient Descent

Now that we have our Cost Function all we need to do is find the minimum value of it to get the best predictions. And we can do this by applying partial differentiation to the function.

According to the Convergence Theorem, the ideal β value can be calculated using the equation below.

Logistic Regression — Convergence Theorem
Convergence Theorem (Image 21)

All we need to do is find the value of ∂J/∂βn for each β and we are good to go.

We know the Cost Function so we can get the value of ∂J/∂β0 by applying partial differentiation to it.

Logistic Regression — Derivation of ∂J/∂β0
Derivation of ∂J/∂β0 (Image 22)

We know that

Logistic Regression — Derivation of ∂J/∂β0
Derivation of ∂J/∂β0 (Image 23)

On adding −1 and 1 to the above equation, we get

Logistic Regression — Derivation of ∂J/∂β0
Derivation of ∂J/∂β0 (Image 24)

On substituting ∂pi∂β0 in the derivative of the cost function with respect to β0, we get

Logistic Regression — Derivation of ∂J/∂β0
Derivation of ∂J/∂β0 (Image 25)

Similarly, if you differentiate J with respect to β1, you will get

Logistic Regression — Value of ∂J/∂β1
Value of ∂J/∂β1 (Image 26)

In general, for βn you will get

Logistic Regression — Value of ∂J/∂βn
Value of ∂J/∂βn (Image 27)

Now that we have the Cost Function and a way to implement Gradient Descent on it, all we need to do is run a loop for some number of iterations to get the best values of all the βs for our classification problems.

Once we have them, we can use them to create a line and pass it inside the sigmoid function. Which would look something like this,

Logistic Regression — Sigmoidal Line
Sigmoidal Line (Image 28)

If you compare this to the line in Image 7 you can see that it overcomes the shortcoming the previous line had. The values predicted by this line are between 0 and 1.

Once we have the ideal β values we can pass them into the equation in Image 4 to get the Decision Boundary.

Logistic Regression — Decision Boundary using βs
Decision Boundary using βs (Image 29)

Sources -

  1. StatQuest on Youtube,
  2. Article on Analytics Vidhya by Anshul Saini.
  3. Article on Medium.com by Asha Ponraj.
  4. Article on KDNuggets by Clair Liu.
  5. Article on satishgunjal.com by Satish Gunjal.
  6. Article on Log Loss by Megha Setia on Analytics Vidhya.
  7. Kaggle notebook by Rishi Jain.

--

--