How Logistic Regression works?

Joel Varma Dirisam
Feb 24, 2020 · 4 min read

Logistic Regression is basic machine learning algorithm which promises better results compared to more complicated ML algorithms. In this article I’m excited to write about its working.

Starting off

Any machine learning algorithm working can be represented in geometric space because it follows basic math premise. Think of it like this, your data points are set in some n-dimensional space where n is the number of features you have in your data. The below example is in a 2D space where blue represents one class and orange represents another.

The yellow line represents a plane which we have to fit such that this plane divides both classes of data points accurately. Let’s call this plane as W^T( as in we are transposing W).

WT is normal to the plane and finding it is the optimisation problem because we need to perform brute force analysis to see which WT plane or which values for WT matrix correctly classifies the data classes and it has to fit as many points as possible thereby making it distant enough from data points. So when we multiply WT with Xi we get Yi which is the output. Best or optimised WT plane gives us the Sum of all f(X) = Yi*WT*Xi (i=1..n) to be maximum.

Below are the 4 types of cases covering both positive and negative values for Yi.

Case 1: Y*WT*Xi > 0 — when WT *Xi >0 and Yi >0

Case 2: Y*WT*Xi > 0 — when WT *Xi < 0 and Yi < 0

Case 3: Y*WT*Xi < 0 — when WT *Xi < 0 and Yi >0

Case 4: Y*WT*Xi < 0 — when WT *Xi > 0 and Yi < 0

If you think the maximising the f(X) = Yi*WT*Xi is not the right approach. You are right. Why?

Allow me walk you through an example :)

Say we have four data points x1,x2,x3, x4 in a 2D space where x1 is positive class and x2,x3,x4 are negative classes.

Some of the normal planes are susceptible to outliers. How?

Take a case for 4 datapoints x1,x2,x3,x4, we get 50,-1,-2,-3 values for the function f(X) = Y*WT*Xi for some inefficient WT. Summing it up we get 50 -1 -2 -3 = 44 but if only 1 point is classified correctly and for every other WT we get value less than 44.

Take another case where f(X) for 4 datapoints we get 28,1,2,3. But the sum is 34 which is less susceptible to outliers and the value is lower than previous and hence previous normal plane is found but in this case all four points are classified correctly.

For this reason we use a Squashing function to penalize the larger numbers(outliers) and restrict the range to [0,1]

Graph for plot 1/(1+e^-x):

The reason why a sigma function is choosen for squashing the values is it has two properties:

1. Probabilistic interpretation

2. Easy to differentiate

So the equation we are solving for finding perfect WT after applying squashing function is:

the argmax is maximising the function f(X).

Geometrically we can prove that if g(x) is monotonically increasing , then g(f(x)) is also monotonically increasing.


F(x) = X² has a minimum value at x = 0

G(f(x)) = log(x²) also has a minimum value at 0

After simplifying the original equation we get:

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Start a blog

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store