THE STORY OF LOGISTIC REGRESSION…

Published in

Analytics Vidhya

7 min readJun 8, 2020

Logistic Regression is one of the simple and elegant classification techniques, which is generally use to find the hyperplane(multidimension) or line(2D) which best seperates positive points from negative points or in another words Logistic Regression is one of the best techniques for BINARY CLASSIFICATION.

I will be very straight forward to explain all the details in LR(LOGISTIC REGRESSION in short for now).

Its a wonderful algorithm, just take algorithm as a story and its equations as a poetry and rest you will automatically fall in love with Machine Learning.

Lets get started then,

So the aim of LR is to find the hyperplane which best seperates positive points from negative points or which seperated two diffferent classes.

In this article series, i will try to explain the LR algorithm through geometry and through LOSS FUNCTION.

Lets begin with Geomtery

In the picture above we can see that there are two different classes , lets say orange is positive and blue is negative, now our aim is to find the best possible line (if 2D) or best possible hyperplane(if multidimension), so what is the equation of a line

y=mx+c, where c is the intercept and m is the slope of the line.

what is the equation of hyperplane?

w^T *x+b=0, where T means transpose , equation means (w transpose x +b), where w is vector of dimensionality d , x is also a vector with dimensionalty d and b is a scaler quantity (intercept).

“w” here is always perpendicular to the hyperplane.

So if i generalize for all the dimensions i can say that In LR we need to find the equation of hyperplane which is

w^T *x+b=0

so basically the task is very simple i just need to find w and b.

Here we are taking a very big assumption, if you notice we are saying that we just need to find a hyperplane which seperates two classes, what if those two classes are not linearly seperable.

like in the above picture.

ASSUMPTION OF LOGISTIC REGRESSION- WE ARE ASSUMING THAT CLASSES ARE LINEARLY SEPERABLE

If you would have read my other articles we saw some assumption in NAIVE BAYES AND KNN as well

ASSUMPTION OF NAIVE BAYES- FEATURES ARE CONDITIONALLY INDEPENDENT

ASSUMPTION OF KNN- NEIGHBOURHOOD OF A POINT IS SAME AS A POINT

Lets get back to logistic regression, so we know the assumption and for now we know that we can’t use logistic regression if classes are not linearly seperable.

So lets take an example where we have two classes positive aand negative, so lets say y=+1 when there is a positive point and y=-1 when there is a nagative point. Now someone can ask that here why we are taking y=-1 when in all the other algorithms we take y=0 . I will come to this in a while.

so now we have yi that belongs to {-1,+1}

Now we know that distance from a hyper plane is di=w^T*xi/||w||, where w is perpendicular to hyperplane and a unit vector so ||w|| is 1, then

di=w^T*xi (distance from hyperplane to all the positive points)

lets say xi are all the positive points and xj are all the negative points as shown in the image.

Then dj=w^T*xj(distance from hyperbola to all the negative points)

and we can say that

di=w^T*xi > 0 (w and xi vectors are on the same sides)

dj=w^T*xj <0 (w and xj are on the opposite sides)

NOW MY CLASSIFIER WILL LOOK LIKE THIS:

if w^T*xi>0 then yi=+1

if w^T*xi<0 then yi=-1

Now i want to mention one small detail that this classifier is not perfect, this will also make mistakes, like for the xq point ,w^T*xi>0 but yi=-1 so obviously this classifier will make mistake.

Lets see some cases:

CASE 1

if yi=+1 and w^T*xi>0

then, yi*w^T*xi>0

CASE 2

if yi=-1 and w^T*xi<0

then, yi*w^T*xi>0

it simply means whenever classifier classifies correctly, yi*w^T*xi>0 this condition satisfies.

CASE 3

if yi=+1 and w^T*xi<0

then, yi*w^T*xi<0

CASE 4

if yi=-1 and w^T*xi>0

then, yi*w^T*xi<0

it simply means whenever classifier classifies incorrectly, yi*w^T*xi<0 this condition satisfies.

This is the reason we made y=-1 for negative points.

So again what is our main aim, to get minimum number of misclassification points and maximum number of classified points, so basically all we want is as many points as possible satisfying

yi*w^T*xi>0, so if we find a plane which has w which satisfies above condition we are done.

We have a optimization problem which is we need to find the optimal w* which maximizes above condition

optimization problem can be seen in below picture.

Now as i told you, for me algorithms are stories and equations are poetry so lets begin with the second fold of this story.

Uptil now we got a basic mathematical problem in which we need to get the “w” or hyper plane such that it maximizes the sum of signed distance(yi*w^T*xi).

Now lets seee whats the problem with this statement and how can w optimize our solution, lets take a simple example in which we have an outlier. so for all the “x”, yi=1(we are assuming that all the “x” in the picture are positive and all the “0” are negatives), so for all the “0” in the picture, yi=-1.

LETS SEE SOME CASES…

CASE 1: Now again task is to find the best “w” or hyperplane which maximizes the sum of signed distance, so lets say algorithm found a hyperplane1 (shown in the picture below), lets find the sum of signed distance for the picture.

Distance from the hyperplane is 1 for all the points except the outlier point, Distance of outlier from hyperplane1 is 100.

In the picture we can see sum comes out to be -90.

CASE 2: Now again task is to find the best “w” or hyperplane which maximizes the sum of signed distance, so lets say algorithm found a hyperplane2 (shown in the picture below), lets find the sum of signed distance for the picture.

Distance from the hyperplane is increasing by 1 for all the points except the outlier point, Distance of outlier from hyperplane2 is 1.

In this picture we can see sum comes out to be 1

Now if we go by statement that “we want hyperplane which maximizes the sum of signed distance we will opt hyperplane 2(sum is 1) rather than hyperplane1(sum is -90) but if we think intutively, hyperplane 1 is best classifier with just one mistake.

NOW WHAT TO DO?

Can we think of something which can taper off extra distance because we just got the problem from outliers.

if signed distance is small — use signed distance as it is

if signed distance is large — use a function which can taper off extra distance

As in the picture below, i have plot signed distance vs f(signed distance), this f(signed distance) must be able to taper off extra distance like in the picture it did.

This tapering off distance is called “SQUASHING”.

So we can replace our optimization equation as

Now the question is, Which function to use?

There are lots of functions which can be helpful in tappering off but we prefer sigmoid function as shown in picture because of two main reasons.

It has a very nice probablistic Interpretation.
It is easy to differiantiate and we want our functions to be easily diiferiantiable otherwise we won’t be able to solve our optimization problem.