Loss Functions Unraveled

5 min readAug 15, 2023

Part 2: Regression Loss Functions.

Regression loss functions are used for regression tasks, where the goal is to predict a continuous value.

Mean Squared Error (MSE):

It measures the average of the squared difference between the actual and predicted values.

In other way, it can be defined as Mean of Square of Residuals for all the datapoints in the dataset. Residuals is the difference between the actual and the predicted prediction by the model.

Mean Squared Error (MSE) is also called L2 Loss.

The differences in Mean Squared Error (MSE) are squared for two main reasons:

1. Squaring of residuals is done to convert negative values to positive values. The normal error can be both negative and positive. If some positive and negative numbers are summed up, the sum maybe 0. This will tell the model that the net error is 0 and the model is performing well but contrary to that, the model is still performing badly.

2. Squaring also gives more weightage to larger errors. For example, a prediction error of 3 units is considered 9 times worse than a prediction error of 1 unit when using MSE. When the cost function is far away from its minimal value, squaring the error will penalize the model more and thus help in reaching the minimal value faster.

Pros:

Convexity: MSE is a convex function, meaning that it has a single global minimum. This makes it easier to optimize compared to other loss functions that may have multiple local minima.
Continuous and Differentiable: MSE is a continuous and differentiable function, which means that it is well-suited for use with gradient-based optimization algorithms such as gradient descent.

Cons:

1. Outlier Sensitivity: MSE is sensitive to outliers in the data, meaning that a single large error can greatly impact the overall loss, as the residuals are squared.

2. Not Robust to Non-Normal Distributions: MSE assumes that the errors are normally distributed, which may not be the case in some real-world applications. In such cases, MSE may not be the best loss function to use.

The ideal value of MSE is zero. closer the value to zero, better the model is performing.

Mean Absolute Error (MAE):

It calculates the average of the absolute difference between the actual and predicted values.

Mean absolute error (MAE) also called L1 Loss

In Mean Absolute Error (MAE), the absolute value of the difference between the true and predicted values is taken instead of the squared difference as in Mean Squared Error (MSE) because -

1. The absolute value of the differences is a more robust measure of the errors. The absolute value of the differences gives equal weight to both positive and negative errors, whereas the squared differences give more weight to larger errors.

2. The use of the absolute value of the differences in Mean Absolute Error (MAE) makes it more robust to outliers compared to Mean Squared Error (MSE).

MAE is generally less preferred over MSE as it is harder to calculate the derivative of the absolute function because absolute function is not differentiable at the minima. As the error approaches 0, gradient descent optimization will not work, as the function’s derivative at 0 is undefined (which will result in an error, as it is impossible to divide by 0).

The ideal value of MAE is zero. closer the value to zero, better the model is performing.

Mean absolute error (MAE) also called L1 Loss

In Mean Absolute Error (MAE), the absolute value of the difference between the true and predicted values is taken instead of the squared difference as in Mean Squared Error (MSE) because -

2. The use of the absolute value of the differences in Mean Absolute Error (MAE) makes it more robust to outliers compared to Mean Squared Error (MSE).

The ideal value of MAE is zero. closer the value to zero, better the model is performing.

Huber Loss:

If the absolute difference between the actual and predicted value is less than or equal to a threshold value, 𝛿, then MSE is applied. Otherwise — if the error is sufficiently large — MAE is applied. In simple terms, the above radically says is: for loss values less than (𝛿) delta, use the MSE; for loss values greater than delta, use the MAE. This way Huber loss provides the best of both MAE and MSE.

Here, delta is a hyperparameter that determines the transition point between the Mean Squared Error (MSE) and Mean Absolute Error (MAE) regimes.

A larger value of delta will make the loss function more robust to outliers. This can be useful in situations where there is a high degree of measurement error or noise in the data. On the other hand, a smaller value of delta will make the loss function more sensitive to large errors, which can be useful when it is important to minimize the impact of large outliers.

The graph shows the loss value on the vertical axis, and the error value on the horizontal axis. For small errors, the loss value increases quadratically, which represents the MSE. For large errors, the loss value increases linearly, which represents the MAE. Huber Loss switches between the two regimes at delta.

The ideal value of hubber loss is zero. closer the value to zero, better the model is performing.

Final Note: Thanks for reading! I hope you find this article informative.

As you prepare to turn the page to the next chapter, “Loss Functions Unraveled | Part 3: Classification Loss Functions,” I encourage you to remain steadfast in your pursuit of knowledge. The world of deep learning is a realm of endless discovery, and together, we’re peeling back its layers one insight at a time.

Stay tuned, and let the exploration continue!

Loss Functions Unraveled

Written by om pramod