The Loss Function Diaries : Ch 2

Divakar Kapil
Escapades in Machine Learning
7 min readAug 30, 2018
Cross entropy figure [3]

In the previous chapter I covered three concepts namely:

  1. Definition and purpose of loss functions
  2. Probability vs Likelihood
  3. Maximum Likelihood Estimation

If you aren’t familiar with the above mentioned concepts, I highly encourage you to go through the first chapter. In this post I will go through the procedure of obtaining the mean squared error loss function (MSE) and binary cross entropy (BCE) followed by an explanation about the reasons they are better suited than the other for certain problems. So, let’s dive in :)

Mean Squared Error

To demonstrate the derivation, let’s consider a linear regression model in two dimensional space where Y are the labels to be modelled as a linear function of the input X.

Fig1 : Linear regression in two dimensions[2]

Let η be gaussian noise with mean equal to 0 and variance of 1 added to the regression. This means that Y is a random variable with gaussian distribution. As discussed in chapter 1 of this series, the gaussian distribution is characterized by two parameters namely the mean and the variance. So, the likelihood problem is to compute the values of the mean and variance to obtain the most optimal distribution for the random variable Y given the input (condition) X. Hence, the mean and variance of Y are as follows:

Fig2 : Mean of the gaussian distribution of Y[2]

The mean (expected value) of Y is θ𝞸+θ₁x

Fig2 : Variance of gaussian distribution of Y[2]

The variance of Y is 1. This means that all we need to find is the optimal values of θ𝞸 and θ₁. If you think about the problem of linear regression it is indeed the problem of computing the above mentioned values to obtain the best fitting curve with some noise added to it for generalization. So, all I have done so far is re-phrased the regression modelling problem as a likelihood problem. We have completed steps 1 and 2 of solving the problem (refer to chapter 1)

Now, that we know that Y has a gaussian distribution we can use the pre-defined formula to get the values of the parameters we want. This can be written as the probability distribution function which yields the probability of observing a single case (xi,yi) as shown:

Fig3 : Probability distribution function of Y (gaussian) [2]

Note the above mentioned probability distribution function only represents one example. In reality our dataset will have many such examples that is there will be many values of Y and X. Say our dataset has N examples. So, we need to sum up the probabilities of all these N examples. Since, every example of gaussian distribution is independent of each other, the sum of probabilities of all N examples can be written as a product. This concludes step3(refer to chapter 1).

Fig4 : Likelihood of all N examples[2]

As mentioned earlier likelihood and probability have the same mathematical formula, the sum of all N probabilities is essentially the likelihood of all N examples (as we are solving for parameters). Now that we have the likelihood function we want to maximize it with respect to the parameters θ𝞸 and θ₁. This step is called the maximum likelihood estimation. To get the maxima of the function we will differentiate it with respect to θ𝞸 and θ₁. However, that will be one ugly process due to the product and the exponents present in the function.

The process of differentiating will be much easier if we can convert the product to a summation and get the exponents down. The logarithm function allows us to accomplish both. Also, logarithm is a monotonic , smooth fucntion which means that a) it can be easily differentiated and b)maxima of the log is the maxima of the likelihood function. Thus, by taking log on both sides we get :

Fig5 : Log likelihood of Y[2]

Violà! If you look closely at the above formula you will notice that it is the negative of the MSE loss function. Let θt be the vector representation of the parameters of θ𝞸 and θ₁. Then the MSE loss function is :

Fig6 : MSE loss function[2]

Hence, maximizing the likelihood is the equivalent of minimizing the MSE loss fucntion. So, the loss fucntion wasn’t an arbitrarily chosen function rather it is the proper mathematical solution of the likelihood problem considering certain assumptions.

So, the MSE loss function is best suited to the case with the following assumptions:

  1. Outputs are real valued
  2. A ceratin amount of gaussian noise is added to the regression model with a constant mean and variance

Binary Cross Entropy

This is one of the most famous loss function used in classification problems with two classes. This can be extended to multiple class classification where all the calsses are mutually exclusive. In the binary case the activation fucntion used is a sigmoid function which produces values between [0,1] and for the multiple class case the activation function used is a softmax function. Now, I will go through the same steps above to demonstrate that the binary cross entropy loss function is the result of the maximum likelihood estimation problem and not some arbitrary function.

For the binary calssification problem let the predictions made by our model be
hθ(xi) where xi is one example of the input X. Since it is a binary classification problem the predictions are computed using the sigmoid function. Note the parameter that we need to find the value of to obtain the optimal distribution is W and b.

FIg7 : Sigmoid function used for prediction labels[2]

We know that the sigmoid function produces values between 0 and 1 which means that these values can be treated as probabilities of the example xi belonging t the positive class.If this probability is less tahn 0.5 we classify it as a negative example. Thus, the probability of observing a positive and negative example can be written as follows:

Fig8 : Probability of observing a positive and negative example[2]

The combination of the above two cases can be expressed as :

Fig9 : Bernoulli distribution[2]

This is in fact the Bernoulli distribution function!

Now, that we have the probability of one example we can combine the probabilities of all N examples in our dataset by :

Fig10 : Likelihood function[2]

Again, to solve the maximum likelihood estimation problem we need to differentiate it with respect to the parameters. However, we face the same challenges as mentioned above for the regression problem. Thus, we take the log likelihood instead.

Fig11 : Log likelihood[2]

On close inspection of the above formula, we notice that it is the negative of the binary cross entropy loss fucntion. So, maximizing the log likelihood is the equivalent of minimizing the BCE loss function.

The BCE loss function is best suited to the case with the following assumptions:

  1. Output is discontinuos and binary
  2. The probability distribution function of the output random variable is Bernoulli

Note

It was pointed out to me that the term ‘cross entropy’ is not specific to identify the negative log-likelihood of the Bernoulli or softmax distribution. Though people almost always use cross entropy in context of classification (Bernoulli and softmax), it is indeed any loss consisting of negative log-likelihood.

The training dataset defines a data distribution called the empirical distribution and the model that we create to make predictions defines a probability distribution. Any loss consisting of a negative log-likelihood is a cross-entropy between the two distributions. The aim is to minimze the disimilarity between the two distributions. Thus, the attempt of maximizing the likelihood is one way of achieving this.

So, the mean squared error (MSE) is the cross entropy between the empirical distribution and gaussian distribution![1]

I will conclude this part here. In this chapter we saw how a method called maximum likelihood estimation is used to derive loss functions for regression and calssification problems. In the next chapter I will cover some more loss functions like absolute mean and smooth absolute mean (regression) and margin classifiers (SVMs). So stay tuned :)

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References

[1] http://www.deeplearningbook.org/contents/ml.html

[2] http://rohanvarma.me/Loss-Functions/

[3] http://ml4dummies.blogspot.com/2017/08/cross-entropy-loss-and-maximum.html

--

--

Divakar Kapil
Escapades in Machine Learning

4th year CE undergrad at University of Waterloo | Machine Learning enthusiast :)