Negative Log Likelihood Loss: Why Do We Use It For Binary Classification?

Deriving a cost function for Binary Classification.

Prakarsh Bhardwaj
3 min readJun 3, 2020

In this post I want to cover the topic of how we come up with cost functions for probabilistic models like Linear Regression.

Through this post I intend to do provide beginners better understanding about cost functions- what they actually measure and how you can come up with your cost function given any data-distribution.

**Note**- Though I will only be focusing on Negative Log Likelihood Loss , the concepts used in this post can be used to derive cost function for any data distribution.

Notations Used

  1. (X,Y)- Date-set

2. θ- Model Parameters

3. h(X)- Hypothesis Function

Prerequisite

Familiarity with Bernoulli Distribution.

Assumptions of Logistic Regression

Before we can even begin judging our model parameters as good or bad we must know the assumptions we have made while designing our model. These are the assumptions we make while designing any Logistic Regression model-

  1. y(i) | x(i);theta ~ Bernoulli(Φ) , where Φ = h(x(i))
  2. Independent Variables X are i.i.d(Independently and Identically Distributed) i.e one training example doesn’t effects the others.

What Are We Trying To Approximate?

We assume that their is some real world Stochastic process which lead to the generation of our given data. Using our Logistic Regression Model we are trying closely approximate this real world process , thus we need to find value of θ which maximizes the probability of our data-set.

Likelihood Of θ

Suppose we have a particular value of θ. Likelihood of θ is a measure of how well the given data supports that particular value of θ . In simple words , Likelihood of a particular value of θ is the probability that our model gives true values of Y as a output when given X as input.

Likelihood of θ

For Logistic Regression ,

For Logistic Regression

Probability Mass Function of Bernoulli Distribution ,

pmf of Bernoulli Distribution

Using the above results we can calculate the log likelihood of θ(we use log as it makes the optimization problem easier)-

log likelihood

Maximum Likelihood Estimation

In statistics , Maximum Likelihood Estimation is a way to finding the best possible parameters which make the observed data most probable. This is done by finding parameters θ that maximize the likelihood function.

Since we want our loss function to be a measure of how bad our model is therefore we define loss function as -l(θ).

Negative Log Likelihood Loss

Now you can see how we end up minimizing Negative Log Likelihood Loss when trying to find the best parameters for our Logistic Regression Model.

--

--