A Quick Guide to Loss Functions

Soumo Chatterjee
Analytics Vidhya
Published in
4 min readOct 11, 2019
Photo by Luke Chesser on Unsplash

The agenda I want to share through this blog is that, in machine learning during each iteration of our training process we compare between our predicted and actual output.This comparison produces an error value and that error is what we minimize during the learning process using the optimization strategy Gradient Descent.

The way by which we are actually calculating the error value is by using a loss function. It quantifies how wrong are we be if we use our current model to make predictions on X(independent value) when the correct output is Y(dependent value). Our main aim is to minimize it as close to zero as possible.

In the context of machine learning we classify the loss functions broadly into two types:

  1. Classification Loss
  2. Regression Loss

1 — Classification Loss:

It is of two types: Cross Entropy and Hinge Loss.

  • Cross Entropy Loss — It is also Known as Log Loss. It is determined by sum of the negative of all the predicted probabilities/outputs multiplied by (actual outputs) for there are many as class there are and that will give us our error.
Cross Entropy Loss Function

Here, yi is actual output, p(yi) is predicted output probability at ith position, N is the no. of training examples

  • Hinge Loss — Another type of loss function used in classification usually in SVM(Support Vector machine),it penalizes the prediction not only when they are incorrect but also when the prediction is not confident. It penalises the prediction that are really in big way. But confidently correct predictions are not penalized at all.
Hinge Loss function

where yi is actual output and hθ(xi) is our hypothesis/predicted output probability at ith training example.

So from this we can formalize that our labels can be 1 or -1, so the loss is zero when the sign matches and hθ(xi) is greater than or equal to 1.

Hinge loss is easier to compute as compared to Cross entropy loss. It is faster to train with Gradient Descent because most of the time gradient is Zero. So, we don’t have to update the weights. If we want to make real time predictions with lesser accuracy always depend upon hinge loss over Cross entropy loss. But if accuracy over speed matters then we should always go with cross entropy loss.

2. Regression Loss —

It is of three types Mean Squared ,Absolute and Huber Loss.

  • Mean Squared Loss or L2 loss — It calculates or measures the average amount that the model predictions vary from the correct value.
Mean Squared error loss function

here yi is actual output, p(yi) is predicted output probability at ith position, n is the no. of training example.

So, in MSE we can calculate the difference between the predicted output and the actual output , square it and we will do that for every training example or data point , add them all up and divide by total no. of training examples. The reason of squaring is that let our result be quadratic so when we plot it will only have one global minimum. So, with optimization strategy like Gradient Descent we don’t get stuck with the local minimum which will ultimately help us in finding ideal parameters values to optimize objective function.

  • Mean Absolute error or, L1 Loss — It calculates the average magnitude of errors in a set of prediction without considering their direction . We take average over the test sample of the absolute differences between the predicted output and the actual output where all the individual differences have equal weights. In squared error we are penalizing large deviations by squaring them which makes the outlier a much more larger no. after squaring thereby , mean absolute error is more robust to outliers than mean squared error. MAE(Mean Absolute error) assigns equal weights to the data whereas MSE(Mean Squared error) emphasizes the extremes.
mean absolute loss function
  • Huber Loss — It’s a loss very similar to MAE. It is also very less sensitive to outliers than MSE .It is quadratic to large values and linear to small values.

So, depending on whether your problem is a regression or, classification problem we can use several loss functions then, we can pick a loss function that optimizes either accuracy or speed. So , hope you liked reading my blog . If u have any queries please comment down below. Until then, LEARN , UNDERSTAND , IMPLEMENT & REPEAT.

Credits — you Tube video(Siraj Raval, Loss Function Explained)

--

--

Soumo Chatterjee
Analytics Vidhya

Machine learning and Deep Learning Enthusiast | | Mindtree Mind | | Python Lover