How Logistic Regression works?
Logistic Regression is basic machine learning algorithm which promises better results compared to more complicated ML algorithms. In this article I’m excited to write about its working.
Any machine learning algorithm working can be represented in geometric space because it follows basic math premise. Think of it like this, your data points are set in some n-dimensional space where n is the number of features you have in your data. The below example is in a 2D space where blue represents one class and orange represents another.
The yellow line represents a plane which we have to fit such that this plane divides both classes of data points accurately. Let’s call this plane as W^T( as in we are transposing W).
WT is normal to the plane and finding it is the optimisation problem because we need to perform brute force analysis to see which WT plane or which values for WT matrix correctly classifies the data classes and it has to fit as many points as possible thereby making it distant enough from data points. So when we multiply WT with Xi we get Yi which is the output. Best or optimised WT plane gives us the Sum of all f(X) = Yi*WT*Xi (i=1..n) to be maximum.
Below are the 4 types of cases covering both positive and negative values for Yi.
Case 1: Y*WT*Xi > 0 — when WT *Xi >0 and Yi >0
Case 2: Y*WT*Xi > 0 — when WT *Xi < 0 and Yi < 0
Case 3: Y*WT*Xi < 0 — when WT *Xi < 0 and Yi >0
Case 4: Y*WT*Xi < 0 — when WT *Xi > 0 and Yi < 0
If you think the maximising the f(X) = Yi*WT*Xi is not the right approach. You are right. Why?
Allow me walk you through an example :)
Say we have four data points x1,x2,x3, x4 in a 2D space where x1 is positive class and x2,x3,x4 are negative classes.
Some of the normal planes are susceptible to outliers. How?
Take a case for 4 datapoints x1,x2,x3,x4, we get 50,-1,-2,-3 values for the function f(X) = Y*WT*Xi for some inefficient WT. Summing it up we get 50 -1 -2 -3 = 44 but if only 1 point is classified correctly and for every other WT we get value less than 44.
Take another case where f(X) for 4 datapoints we get 28,1,2,3. But the sum is 34 which is less susceptible to outliers and the value is lower than previous and hence previous normal plane is found but in this case all four points are classified correctly.
For this reason we use a Squashing function to penalize the larger numbers(outliers) and restrict the range to [0,1]
Graph for plot 1/(1+e^-x):
The reason why a sigma function is choosen for squashing the values is it has two properties:
1. Probabilistic interpretation
2. Easy to differentiate
So the equation we are solving for finding perfect WT after applying squashing function is:
the argmax is maximising the function f(X).
Geometrically we can prove that if g(x) is monotonically increasing , then g(f(x)) is also monotonically increasing.
F(x) = X² has a minimum value at x = 0
G(f(x)) = log(x²) also has a minimum value at 0
After simplifying the original equation we get: