Photo by Drew Beamer on Unsplash

MACHINE LEARNING 101

Logistic Regression — A Complete Guide

The only guide you need to learn Logistic Regression

Prathamesh Gadekar
11 min readMar 23, 2023

--

Introduction

Our focus in this blog will be on a supervised machine-learning algorithm called Logistic Regression. We will explore the different types of logistic regression, mathematical derivation, regularization (L1 and L2), and the best & worst use cases of logistic regression. Finally, we will learn about the Time and Space complexities of logistic regression.

Note: In case any image doesn’t load while exploring the mathematical derivation please copy the Alternate Text provided and copy and paste it into Latex Equation Editor to get the equation.

What is Logistic Regression?

Logistic regression is a supervised machine learning algorithm that helps us in finding which class a variable belongs to the given set of a finite number of classes.
For example, let’s say you are given a set of features about a human and you have to predict if the human’s gender is Male or Female. Here the classes are {Male, Female}, and the size of the set is finite.

The intuition of Logistic Regression

Let’s say you are given a 2-D graph with some blue and red points (Here both colors represent different classes). You need to find the simplest way to separate the two classes.

Dot separation example
Try to find a way to separate the red and blue points.

You guessed it right! We can just draw a line to separate both points. We can clearly observe in the diagram below that the black line helps us to distinguish between the red and blue points.

Black Line Separating both red and blue points.

Logistic regression tries to do the same and finds us a line that will separate the points most effectively. It is important to note that if the points were three-dimensional, we would classify them using a plane, while if they were n-dimensional, we would use a hyperplane. This hyperplane that separates both classes is also called a Decision Surface or Decision Boundary.

Assumption of Logistic Regression

Logistic regression makes some assumptions about the dataset. The assumptions are as follows:

  • In the above example, you saw that a straight line could separate both the classes but if you have data like:
Inseparable Data Points
Inseparable Data points

You can see that there is no way to separate these points by a line. Logistic regression assumes the data is linearly separated (i.e. the data can be classified by a hyperplane).

  • Logistic regression assumes that the data provided is sufficiently large.
  • It assumes that the observations are from different sources i.e. we are using different individuals to get multiple data points.
  • It also assumes that we have a binary classification i.e. we only have two classes. For Example, {Male, Female},{Blue, Red}, etc.

Types Of Logistic Regression

It is classified broadly into three types:

  • Binary Logistic Regression:
    In this type, we only have two classes into which we can classify our variable. For Example, {Male, Female},{Blue, Red}, etc.
  • Multinomial Logistic Regression:
    In this type, we have more than two classes to classify our variable into. For Example, {Red, Blue, and Green}.
  • Ordinal Logistic Regression:
    It is similar to a multinomial but here the classes are comparable. As an example, let us consider the following movie ratings: {Bad, Neutral, Good}. We see that the ratings have an order in them which is Bad<Neutral<Good. If, however, we have the following classes {Red, Green, Blue}, there is no order between them since no color is superior to another.

Mathematical Equations and Derivations

Let us say that we have binary classification with class labels {1,0}. Let’s define a plane(π) which is the decision surface of Logistic Regression whose equation is W.X + b = 0.

Here, W is the normal vector of the plane π and X is the dimensions of the point. Assuming for simplicity that that plane passes through the origin then we can say that b = 0. Hence now the equation of π: W.X = 0. Our task is to find the most optimal W such that using the plane we can classify most of the points accurately.

Let’s say ‘d’ is the distance of a point from the plane. If d>0 then we can say the point lies above the plane and if d<0 it lies below the plane.

Introduction to Sigmoid Function
A sigmoid function ( σ(d) ) is defined as:

\sigma\left ( d \right ) = \frac{1}{1+ e^{-d}}
Sigmoid Function

The domain of the sigmoid function is [0,1]. The outliers will significantly affect the calculations if we directly use the distance between the point and the plane as a metric to find the optimal plane. As a result, we must fairly handle the outliers by reducing their value, hence we need a function that can treat the outliers fairly by squashing their value.

Graph of Sigmoid Function and Derivative

Why use Sigmoid Function?
Many functions can do the job of the sigmoid function, yet it’s recommended for the following three reasons:

  • It provides strong Probabilistic Interpretation :
    We can say that σ(d) is the probability that a point belongs to class 1. The larger the absolute distance is from the plane, the probability increases or decreases if it corresponds to class 1 or 0 respectively. If a point belongs to class 1 then the positive distance is taken into consideration. The negative absolute distance is taken if the point belongs to class 0. Assuming d = 0, then the point lies on the plane, and this implies that the point has an equal probability of falling into either class.
Example
  • Faster Calculation of Derivative :
    We can write the derivative of the sigmoid function in terms of itself. Normally, the time complexity to calculate the derivative of any function has a nonconstant time complexity. However, this property of the sigmoid function enables us to calculate the derivative faster and makes the algorithm more efficient.
\sigma’\left ( d \right ) = \sigma\left ( d \right )(1-\sigma\left ( d \right ))
Derivative of sigmoid in terms of sigmoid
  • Squashing:
    The sigmoid functions treat outliers fairly by squashing their unfair impact on calculations as shown in the above graph. We can see that, for larger positive values of d, σ(d) converges to 1, and for larger negative values of d, σ(d) converges to 0. Hence instead of using d directly, we use σ(d).

How to find the most optimal plane (π) in Logistic Regression:
The derivation is as follows:
Let’s say we have a binary classification of classes {0,1}. Any point can belong to either one of the two classes. Let’s say the equation of optimal plane (π) is π: W.X = 0 (assume the plane passes through the origin for simplicity).
Let's define P(y = 1|X) as the probability that point X belongs to class 1.
Then we can say that:

P(y=0|X) = 1 — P(y=1|X)
Since a point can only belong to either of the two classes the sum of individual probabilities is 1.

Let's refer to P(y=1|X) as p(X) for further derivation.

The distance(d) of point X from the plane is W.X where W is the normal vector to the plane. Hence, we can say that d = W.X
We know that the sigmoid function σ(d) gives us the probability that a point belongs to class 1.

The Odds Ratio for binary classification is defined as the ratio of the probability of a point X belonging to class 1 to the probability of a point X belonging to class 0.

Hence, we can say that:

Odds = \frac{p(X)}{1-p(X)}
Odds Ratio formula

Since σ(d) gives us the probability of a point belonging to class 1, we can replace p(X) with σ(d), and since d = W.X, we can replace σ(d) with σ(W.X).

Odds = \frac{\sigma (d)}{1-\sigma (d)}\\  Odds = \frac{\sigma (W.X)}{1-\sigma (W.X)}\\  Odds = \frac{\frac{1}{1 + e^{-W.X}}}{1 — \frac{1}{1 + e^{-W.X}}} In case image doesn’t load put this equation in latex equation editor

Taking natural logarithms on both sides we get:

log(Odds) = W.X In case image doesn’t load put this equation in latex equation editor
log-odds for logistic regression

Hence, we get:

log(\frac{p(X)}{1-p(X)}) = W.X In case image doesn’t load put this equation in latex equation editor
log-odds equation

The above equation is popularly known as the log-odds equation.

Our task is to find the most optimal W such that the maximum number of points are correctly classified. To do this, we use the Maximum Likelihood Estimation. It is a method that determines the value of the parameter such that the known likelihood distribution is a maximum.
Let's consider we have (n) number of points in our dataset. Since we have to maximize the number of correctly classified points we can say that:

We have to find a plane W such that it satisfies the two conditions:

  • If a point Xᵢ belongs to class 1 then p(Xᵢ) is as close to 1 as possible.
  • If a point Xᵢ belongs to class 0 then p(Xᵢ) is as close to 0 as possible or we can also say that 1-p(Xᵢ) is as close to 1 as possible.

Hence we can also say that if we take the product of p(Xᵢ) of all points belonging to class 1 and 1-p(Xᵢ) of all points belonging to class 0 then this value should be maximum.

Maximum Likelihood function

Here, L(W) is the Maximum Likelihood Function.

L(W) = \prod_{\forall X_{i}}^{} (p(X_{i})^{Y_{i}}*((1-p(X_{i}))^{(1-Y_{i})}) In case image doesn’t load put this equation in latex equation editor
Taking product common for both parts

Taking natural log on both sides we get:

l(W) = \sum_{i = 1}^{ i = n} (Y_{i}*log(p(X_{i})) + (1-Y_{i})*log(1-p(X_{i}))) In case image doesn’t load put this equation in latex equation editor
Log converts product to summation

Here, l(W) is the Log of Likelihood.
Since, p(Xᵢ) = σ(W.Xᵢ) and σ(W.Xᵢ) = 1/(1- e^(-W.Xᵢ)), We can rewrite the equation as:

l(W) = \sum_{i = 0}^{i = n} (Y_{i}*log(\frac{1}{1 + e^{-W.X_{i}}}) + (1-Y_{i})*log(1-\frac{1}{1 + e^{-W.X_{i}}})) In case image doesn’t load put this equation in latex equation editor
Substituting the value of p(Xᵢ) as the sigmoid function

Rearranging the equation we get:

l(W) = \sum_{i = 1}^{ i = n} (Y_{i}W.X_{i} — log(1 + e^{-W.X_{i}})) In case image doesn’t load put this equation in latex equation editor
Final Equation

To find the most optimal W, we have to maximize the right-hand side of the above equation. Hence we can say that:

W = argmax_{W} (l(W)) In case image doesn’t load put this equation in latex equation editor
Find W such that l(W) is Maximized

In many books, there is also another form of the above equation. This form is generated when we take class labels as {+1,-1} instead of {+1,0}.

W = argmin_{W} \sum_{i = n}^{i = 1} log(1 + e^{-Y_{i}W.X_{i}})
When we take class labels {+1,-1}

Now, we can find the optimal plane (W) by using Gradient Descent Algorithm.

Hyperparameters and Regularization

Hyperparameters are special parameters specified by the programmer for configuring the model. These are mostly used to tune the algorithm for the dataset that we are working on.

Tuning of hyperparameters means finding the particular value of the hyperparameter for which the model gives the desired output on the dataset. Sometimes, we encounter problems with overfitting and underfitting while applying machine learning algorithms. This problem is also solved by tuning the hyperparameters.

Regularization means calibrating machine learning models by using hyperparameters so that we find the most optimal solution and prevent overfitting or underfitting.

Let’s say we have a dataset with features (columns) <f₁, f₂….fₙ> and the weight vector W corresponding to features is <W₁, W₂….Wₙ>. The loss term in the case of logistic regression is:

LossTerm = \sum_{i = 1}^{i = n}log(1 + e^{-Y_{i}W.X_{i}})
Loss Term

In Logistic Regression, we have a hyperparameter (λ) and regularization is classified on the basis of penalty. The types of regularization are:

  • L1 Regularization
    L1 regularization is also called Lasso Regression or L1-norm. It adds an L1 penalty of the absolute value of the magnitude of the weight vector (W). L1 regularization is preferred if we have a huge amount of features as it creates sparsity in the weight vector, as the weight Wᵢ corresponding to feature fᵢ becomes 0 if feature fᵢ is of less or no importance.
W = argmin_{W} (LossTerm + \lambda \left \| W \right \|^{}_{1})
L1 Regularization
  • L2 Regularization
    L2 regularization is also called Ridge Regression or L2-Norm. It adds an L2 penalty of the square of the coefficients of the plane (W). In contrast to L1 regularization where more sparsity is encouraged, in L2 regularization if feature fᵢ is less significant then weight Wᵢ corresponding to that feature becomes very small not necessarily zero.
W = argmin_{W} (LossTerm + \lambda \left \| W \right \|^{2}_{2})
L2 Regularization

The hyperparameter (λ) is calculated using Random Search or Grid Search Algorithm.

When to use Logistic Regression?

Here are some advantages and disadvantages of using Logistic Regression:

The advantages are as follows:

  • It works very well for low-latency systems.
  • Since the sigmoid function squashes the impact of outliers, they have a less adverse impact on the algorithm.
  • It is an interpretable model that can provide you with reasons for a decision made.
  • It is simple to understand and apply.
  • If the data is linearly separable or can be made linearly separable by doing some feature engineering, logistic regression can perform very well.

The disadvantages are as follows:

  • It can have low performance on datasets of small size.
  • If the data is not/can’t be made linearly separable, logistic regression may not be the most suitable model to apply to the dataset.
  • If the features in the dataset are multicollinear, then the model may provide wrong probabilistic interpretations.
  • Complex relationships among the variables are very difficult to find using logistic regression.

Space and Time Complexity Analysis:

Let’s say we have n number of data points and d dimensional dataset i.e. we have d number of features in our dataset.

Train Time Complexity:
While training the model, we have to store the dataset in the memory which has n rows and d columns. Hence, the space complexity is O(n*d).

The training time complexity is also O(n*d).

Test Time Complexity:
After training the model, we just need to store the Weight Vector (W) which length d. Hence, the space complexity is O(d).

The test time complexity is also O(d) as if a query point (Xq)is given to us then we just have to find W.Xq which requires only a d number of operations.

👋 Greetings!

Thanks for sticking around for the rest of the blog! I hope you had a great time!

I cover all kinds of Data Science & AI stuff…. and sometimes Programming.

To have stories sent directly to you, subscribe to my newsletter.

--

--

Prathamesh Gadekar

A Computer Science Enthusiast. I write about machine learning and data science.