Using Mean squared error loss (MSE) in Logistic Regression?

Pranav Kushare
3 min readNov 13, 2022

--

I have been revisiting some algorithms/concepts that were on the verge of vanishing from my memory :) while revising the logistic regression I realized that I never focused on some minute details. Why don’t we use MSE loss while training a logistic regression model? After googling across different sources here’s what I understood ==>

Before diving into the answer let’s revise some concepts

As you might know normally LogLoss also known as CrossEntropy Loss is used in training a logistic regression classification model. The formulas for the same is as below

LogLoss Formula

In this y is the Actual value or the label

And p is the predicted value by model.

This logloss is a convex function which means there exists only one local minima in the graph. And as with any other ML algorithm Gradient descent is used for optimization which is finding the best value of coefficients so that the value of the cost function is lowest. If you remember one of the main conditions for gradient descent is that the graph should have only one local minima on which it is iterating and trying to find optimal values.

Lets see this by plotting the log loss function. Breaking complicated things into smaller parts always makes it easier and quicker to understand. So let’s break things logloss function into parts. Depending on the value of yi it would be like this==>

Plot of CrossEntropyLoss

Straightforward the local minima can be found at a=0.5. That’s why convex functions are important in gradient descents otherwise the algorithm would never find the exact local minima in a non-convex function as it contains more than one minima’s.

Now, lets move on to our main topic why MSE loss is not used in logistic regression. The whole context mentioned above is sufficient to understand the reason.

1. Non-Convex Nature

Convex vs Concave plots.

The graph of the Mean squared error function is non-convex for logistic regression. As we are putting dependent variable x in a non-linear sigmoid function. As discussed above gradient descent does not work for non-convex functions, logistic regression model would never be able to converge to optimal values.

2. Penalization for wrong predictions.

Okay if you don’t care about the optimizations and finding the best local minima and still want to use MSE Loss, it would still not work. why?

If I ask you, What is the work of the cost function/Loss function?

Basically, a cost function should be able to penalize the model in case of wrong predictions by outputting a higher loss value which will eventually result in a higher gradient value.

But the Mean squared error used in logistic regression does not penalize the model in a great way. Ideally, MSELoss should be high for wrong class predictions but that does not happen, the reason being logistic regression is used for classification and all the predicted probability values lie between 0,1.

Let’s say in a binary classification problem your model predicts a probability of class 1 as 0.1 (This is a clear example of misclassification).

Here our Y_Actual=1 and Y_Pred=0.1

Ideally, for a good cost function, the penalization/error value should be high. Let’s compare the Mean squared error and LogLoss for the same

I hope now it is clear why MSE LOSS is not a good choice in logistic regression. Despite misclassification, MSE is penalizing the model with a lower value as compared to logloss.

Thank you!

Keep an eye out for more suchoutstanding blogs :)

Linkedin

Buy me a coffee

Github

--

--