Layman Definition :
Logistic Regression is a supervised learning classification algorithm used to classify observations into a discrete set of classes.
It is a predictive analysis algorithm and used for the prediction probability score of an event.
Logistic Regression is used when the dependent variable (target) is categorical
- Spam Detection: Predicting if an email is Spam or not
- Credit Card Fraud: Predicting if a given credit card transaction is fraud or not
- Health: Predicting if a given mass of tissue is benign or malignant
- Marketing: Predicting if a given user will buy an insurance product or not
- Banking: Predicting if a customer will default on a loan.
The logistic regression algorithm also uses a linear equation with independent predictors to predict a value. The predicted value can be anywhere between negative infinity to positive infinity. We need the output of the algorithm to be class variable, i.e 0-no, 1-yes. Therefore, we are squashing the output of the linear equation into a range of [0,1]. To squash the predicted value between 0 and 1, we use the sigmoid function.
Why not linear regression?
When the response variable has only 2 possible values, it is desirable to have a model that predicts the value either as 0 or 1 or as a probability score that ranges between 0 and 1.
Linear regression does not have this capability. Because, If you use linear regression to model a binary response variable, the resulting model may not restrict the predicted Y values within 0 and 1.
LINEAR VS LOGISTIC REGRESSION
This is where logistic regression comes into play. In logistic regression, you get a probability score that reflects the probability of the occurrence of the event.
An event, in this case, is each row of the training dataset. It could be something like classifying if a given email is spam, or mass of cell is malignant or a user will buy a product and so on.
Model Representation :
So suppose we have our output for linear equation as z :
And now to this z is given to the function g(x)==> (sigmoid function)which returns a squashed value h, the value h lies in the range of 0 to 1.
h θ(x) = P(y=1|x; θ) ==> Probability that y =1, given x, parameterized by θ ==> Hypothesis
- Decision Boundary :
In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
h θ(x) ≥ 0.5 ==> y=1
h θ(x) < 0.5 ==> y=0
h θ(x)= g( θTx)
We create a cost function J(θ) to minimise the error and make our model more accurate with its predictions.
If we try to use the cost function of the linear regression in ‘Logistic Regression’ then it would be of no use as it would end up being a non-convex function with many local minimums, in which it would be very difficult to minimize the cost value and find the global minimum.
For logistic regression, the Cost function is defined as:
Cost(hθ(x), y) =
−log(hθ(x)) if y = 1
−log(1−hθ(x)) if y = 0
Cost = 0 if y =1, hθ(x) = 1
But as hθ(x) → 0, Cost →infinity i.e, Predict (P(y=1|x;θ) = 0), y=1, we’ll penalize learning algorithm by a very large cost.
Ex: If we say that there is 0% chance for malignant tumor but it turns out to be malignant tumor
Cost = 0 if y =0, hθ(x) = 0
But as hθ(x) → 1, Cost →infinity i.e, Predict (P(y=0|x;θ) = 0), y=0, we’ll penalize learning algorithm by a very large cost.
Ex: If we say that there is 100% chance for malignant tumor but it turns out to be benign tumor
The above two functions can be compressed into a single function i.e.
Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.
The main goal of Gradient descent is to minimize the cost value. i.e. min J(θ).
Now to minimize our cost function we need to run the gradient descent function on each parameter i.e.
This algorithm looks identical to linear regression but the definition for the hypothesis are different.
Linear Reg → hθ(x) = θT x
Logistic Reg → hθ(x) = 1/(1 + e^- (θT x))
One Vs All MultiClassification:
Say we have a classification problem and there are NN distinct classes. In this case, we’ll have to train a multi-class classifier instead of a binary one.
One-vs-all classification is a method that involves training NN distinct binary classifiers, each designed for recognizing a particular class. Then those NN classifiers are collectively used for multi-class classification as demonstrated below:
So the only thing we have to do now really is to train NN binary classifiers instead of just one. And that’s pretty much it.
- We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying the binary logistic regression to each and then use the hypothesis that returns the highest value as our prediction
I hope this article is helpful to you and in future post I am going to take coding part and more details stuffs.