The Loss Function Diaries : Ch 1

Divakar Kapil
Escapades in Machine Learning
7 min readAug 24, 2018

In today’s machine learning techniques, these are indispensable tools used to train models. Each sort of problem uses a specific type of loss function which suits it the best. I have been teaching myself machine learning for some time and have always been perplexed with the origin of loss functions. I always wondered why regression problems use MSE(mean square error) or why classification problems use cross-entropy functions? Where do these come from? Can different loss functions be used with different problems?

So, in this series I will attempt to answer all the above questions based on my learnings. I will try to give an intuitive and mathematical reasons for the origin and use of loss functions. Kindly note that this series doesn’t have a definite number of parts. It is an ongoing series to which I will keep adding more information as I progress in my learnings on the concept.

In this part I will cover the important concepts like the difference between probability and likelihood and the procedure of obtaining loss functions using the maximum likelihood estimation method. These are important to understand before we can learn and understand where some of the loss functions come from and why are they used with certain types of problems.

What is a Loss Function?

Most supervised learning models are optimization problems. In such problems the aim is to maximise or minimise a function that captures the final goal of the problem being solved. Such functions are called objective functions [1]. For example, the problem of generating highest revenue will aim at maximising the profit function computed based on the available inputs. So, objective function is the broader term which includes the functions which need to be maximized or minimzed based on the problem.

Loss functions are a subset of objective functions which are defined on data points and compute the penality of the prediction (output) generated by the model on the data point. The aim is to minimze the loss function so as to incur the smallest penalty. Most models like neural networks aim at minimizing loss functions defined via gradient descent. So, in essence, all we are doing is:

  1. Defining the problem
  2. Defining an objective
  3. Assessing input variables
  4. Minimizing or maximizing the objective function

Hence, in theory and practice any loss function can be used with any type of problem. This means that we can use MSE with classification problems instead of cross-entropies. Mathematically, we would end uo acheiving the above mentioned 4 points. The reason why certain loss functions are used for certain problems is the effectiveness and speed of learning (training) of the model.

The speed of the training depends on the learning rate and the size of the derivative since these two quantities are used to update the weights in a neural network model. The type of loss function used greatly affects the size of the derivative computed for the specified problem. This affects the speed and possibly the effectiveness of training. This will be demonstarted mathematically in upcoming posts.

There is a very important conept that needs to be understood before proceeding to the explanation of different types of loss functions. So, the next section will focus on the concept of likelihood.

Probability vs Likelihood

In everyday use the terms probability and likelihood are interchangeable. There is little to no distinction between the two. However, in the world of statistics likelihood is significantly different from probability. The difference lies in the interpretation of the problem. The mathematical equation is identical for both.

Probability is defined as the area under a fixed distribution curve of obtaining a value. In problems concerning probability events are described with a set of parameters and we are required to compute the possibility of the occurence of an observation given the parameters. For example, given a normal distribution with a fixed mean and standard deviation, we might be asked to compute the probability of a random sample taken from the distribution to acheive a specific value. In this case, note the parameters are the mean and standard deviation and these are fixed. So, to compute all we do is compute the area under this given distribution.

For example, the normal distribution has a fixed mean of 32 grams. The probability of obtaining values between 32 and 34 is computed by evaluating the area:

Fig1 : Computaion of Probability[2]

So, probability is defined as:

P(data|distribution)

Likelihood is the problem of computing the optimal distribution that fits a given set of data. In statistics often the parameters of the distribution are unknown. The only known thing is the data and the observed values. So, the aim of likelihood is to measure the extent to which a sample provides support for particular values of a parameter.

It is computed by obtaining the corresponding y-value from the proposed distribution function of a fixed data point. This is a method of solving for the best parameter value that can be use to describe a set of data points. For example, the likelihood of obtaining a value of say 34 grams for the proposed normal distribution is computed as shown:

Fig2 : Computation of Likelihood[2]

Hence, likelihood can be expressed as:

L(distribution | data)

On close inspection we can see that both probability and likelihood have the same mathematical representaion. The only difference is with the perspective of solving the problem. The image below summarizes the both the concepts.

Fig3 : Summary of differences[2]

Now we will proceed to the concept of maximum likelihood which is needed to explain the origin and the reason of use of MSE and cross-entropy.

Maximum Likelihood Estimation

This is a technique for finding the optimal distribution that fits the data we have. This is exactly what a machine learning model aims to acheive; a distribution curve that best describes the data provided (training step) to it for explaining similar data distributions (inference step). We will understand the intuition behind this concept by considering an example.

For example, consider the following data distribution.

Fig4 : Example demonstration[3]

The aim is to find an optimal distribution for this data. We know we have pre-defined distribution curves for various sorts of data distributions. Considering the above data, we can say that a normal distribution can be used to model it. The reasons are that the data is kind of symmetrical around the center. The center has the most number of points.

However, there are multiple normal distribution curves that can be used to fit the data. Remember, normal ditsribution is defined using (mean, standard deviation). Hence, both of these are parameters which can tae multiple values and our aim is to find the most suitable values of these two. So, we are solving:

L(distribution | data)

We try various normal distribution curves and for each we compute the likelihood of obtaining the most number of points near the mean of the proposed distribution curve. After plotting all the likelihoods we choose the one that gives the maximum likelihood and declare that distribution to maximize that mean. This is one of the two paramters that we need to find the value of. The same can be done with standard deviation. The image below depicts the process for means.

Fig5 : Maximum Likelihood demonstration[3]

Now all we do is choose the distribution curve which provides the maximum likelihood of observing the data points that were measured in the dataset. This chosen distribution is the maximum likelihood estimation for the parameter.

The procedure can be divided into the following steps:

  1. Observe the data and make a guess as to which pre-defined distribution can describe it better eg normal, exponential etc
  2. The distribution curve is used to compute the likelihood of observing the most number of measured data points
  3. We represent all the likelihoods for the different proposed distributions as a product if all distributions are independent of each other
  4. Then we maximize the product with respect to the parameters we are trying to obtain the values of

To see a worked out example check out this video:

I will conclude this part here. Now we are equipped with the mathematical knowledge needed to understand the next part which will cover the derivation and expalnation of MSE and cross-entropy loss functions. So stay tuned :)

If you like this post or found it useful please leave a clap!

If you see any errors or issues in this post, please contact me at divakar239@icloud.com and I will rectify them.

References

[1] https://stats.stackexchange.com/questions/179026/objective-function-cost-function-loss-function-are-they-the-same-thing

[2] https://www.youtube.com/watch?v=pYxNSUDSFH4

[3] https://www.youtube.com/watch?v=XepXtl9YKwc

--

--

Divakar Kapil
Escapades in Machine Learning

4th year CE undergrad at University of Waterloo | Machine Learning enthusiast :)