Why Log Loss is Crucial in Logistic Regression: A Brief Look

Discover the Importance of Log Loss Over MSE for Accurate Classification in Machine Learning

3 min readJun 2, 2024

Some Context…

Logistic Regression is an algorithm that predicts the probability of a particular outcome given a set of inputs. Dive deeper into Logistic Regression below:

Logistic Regression. Simplified.

After the basics of Regression, it’s time for basics of Classification. And, what can be easier than Logistic…

medium.com

Why is a loss function important?

Each machine learning model has a particular loss function which it tries to minimise after training on a set of given examples.

These loss functions, such as RMSE and Huber Loss, are chosen based on the model’s nature and the desired outputs.

In Neural Networks , when the model tries to learn the parameters through gradient descent, it is this loss function which plays a crucial role, calculating gradients and guiding the model towards the steepest descent to minimize the loss.

To discover the minima or maxima of a function, we seek parameters where the gradient (slope) is zero.

Typical quadratic loss function with a single global minima. Image by the Author

As depicted, the function reaches its minimum at a point where its slope vanishes.

Some loss functions present multiple local minima.

Gradient descent might lead to any of these minima instead of the global minimum, influenced by weight initialisation, learning rate, and other factors.

This brings us to the effectiveness of log loss in classification tasks.

What is Log Loss and why is it better?

It all starts with Maximum Likelihood Estimation, which is basically..

Find the parameters for which the probability of finding the correct output is maximum.

In probabilistic terms, we strive to maximise P(Y|X), the probability that the output is Y given the input X.

For N examples, the combined likelihood of correct predictions is the product of individual probabilities for each example.

By taking the log of the maximum likelihood function, we transform the product into a sum, simplifying gradient descent calculations:

To minimise this value and discover optimal parameters, we take the negative of the maximum likelihood.

Log Loss Equation

Why is it better

It all comes down to how gradient function performs with different loss functions.

In Logistic regression , we usually use sigmoid function to output the final prediction.

As its values ranges from 0 to 1, we can model it to output the probability of a particular outcome. So if we take the example of binary classification,

where W and b are the weights and biases we are trying to learn using the examples to accurately predict y.

If we use a loss function such as MSE, the loss function could come out something like below..

Loss function with many local minima, Gradient descent could get stuck in any minima. Image by the Author

As we can see, this loss function has multiple minima , and while training we can reach any minima but when the see the graph of log loss..

Plots of the two parts of the loss function drawn separately. Image by the author

I plotted the two components of the log loss function separately. As inputs approach 0, -log(y) ascends to infinity.

Combining these two functions, We can also see that there is only one global minimum and hence we are guaranteed to reach it through gradient descent algorithm.

If you like this content, please give a clap.

Stay tuned for more explorations and insights. Share your thoughts on what you’d like to see in future posts. Happy coding!