Loss And Cost Functions for Logistic Regression

Ashmi Banerjee
4 min readOct 20, 2022

--

Understanding the difference between Loss Functions and Cost Functions in the context of Logistic Regression.

Photo by Jean-Louis Paulin on Unsplash

In my previous article, we learnt about Logistic Regression as a popular algorithm used by the Machine Learning community for binary classification.

In this article, we will define Loss functions and Cost functions for Logistic Regression and understand the difference between them.

If you have followed my previous blog on Logistic Regression, then you can skip the recap and directly jump to the definitions of Loss Function ⬇️ explained below.😉

Recap

Image Classification using Logistic Regression

So far we covered the following:

The goal of Logistic Regression is to learn the parameters w.T and b so that becomes a good estimate of the chance of y being equal to 1.

Loss Function

The loss function is a method of evaluating how well our machine learning algorithm models the given dataset.

The intuition behind defining a loss function is that we try to measure how good our model is in terms of predicting the expected outcome.
In order to achieve that, we need to optimise (either maximise or minimise) our defined loss function to arrive at the best possible result.

In other words, the Loss function measures how good the output is when the true label or ground truth label is y.

In this case, loss functions are often minimised by different optimisation algorithms (e.g. Gradient Descent GD) until they reach global minima.

There can be multiple loss functions depending on each use case.

For e.g. a very common one is the Mean Squared Error (MSE) function often used for Regression problems. It can be defined as follows:

Mean Squared Loss is often used for regression problems. Here, _i are the inputs and y_i are the output labels for a training sample size of m

However, MSE is not so popular as a Loss function in Logistic Regression as it becomes a non-convex optimisation problem with multiple local optima. Hence, our optimisation algorithm (here, Gradient Descent) will mistake the local minima to be the global one.

The loss function should always be a convex function so that it is possible to optimise it easily.

We will learn about optimisation algorithms in the next blog.

Therefore, in Logistic Regression, we introduce another loss function called the Cross-Entropy Loss.

Cross-Entropy Loss used for Logistic Regression

Here if y = 1, then L(ŷ, y) = −log(ŷ) which means that if we want log(ŷ) to be large, then has to be as large as possible but is a σ (sigmoid) function which cannot be greater than 1.

Similarly, if y = 0, then L(ŷ, y) = log(1 − ŷ) which should be as large as possible.
Therefore, (1 − ŷ) should also be large and should be small but non-negative.

Negative Loglikelihood (Cross-entropy) loss for Logistic Regression

However, this loss function is defined for a single training sample.

We need to extend this to all the m training samples in our dataset and define something similar for all of them.

Thus, we define a Cost function (J) meant for all m training samples.

Cost Function

If Loss Function (L) refers to how well the model is estimating the relationship between the input features (X) and the output (Y) for a single training sample, then Cost Function (J) explains the same for all the m training samples in the dataset.

We define Cost Function (J) as the mean or average of the Cross-Entropy loss defined above.

Therefore, we can mathematically formulate J as the following:

Cost Function for Logistic Regression

Our goal now is to minimise the aforementioned Cost function (J).

We can use different optimisation algorithms to achieve this. However, in this case, we will use the Gradient Descent algorithm.

Conclusion

In this blog, we introduced Loss and Cost functions for Logistic Regression and illustrated their differences.

Now let’s see how we can use Gradient Descent to minimise this cost function for Logistic Regression.

This blog series is inspired by the notations and theory covered by the Coursera Deep Learning specialisation by DeepLearning.ai.

If you like the article, please subscribe to get my latest ones.
To get in touch, contact me on
LinkedIn or via ashmibanerjee.com.

--

--

Ashmi Banerjee

👩‍💻 Woman in tech, excited about new technical challenges. You can read more about me at: https://ashmibanerjee.com/