Probabilistic Justification for using specific Loss function in different types of Machine Learning Algorithms (Part 1)

Syed Muhammad Hamza
Mar 4 · 8 min read

Mathematical Justification for using a specific kind of loss function in different types of supervised Machine learning algorithms through Probability

Essentially, all models are wrong, but some are useful.

George E.P. Box

by Tamas Nepusz, co-creator of igraph

When I started with Machine Learning, one of the most troublesome thing that didn’t make much sense was the choice of the Loss function for different Machine learning algorithms, justifications for using the specific loss function for different kinds of ML algorithms were most of the time based on intuition rather any formal mathematical proof

Thanks to probability theory, today we will answer our elusive question in the most mathematically sound, consistent, and complete way through Probabilistic interpretation.

I’m going to use the material for our purpose from CS229: Machine Learning from Stanford (this course is more advanced and require advanced mathematical understanding as compared to Machine Learning Coursera ( Andrew Ng ) from Stanford prerequisites for this one includes familiarity with multivariable calculus, probability theory, and linear algebra )

But first, a little background


For almost all the beginners first flavor of Machine learning that you get to experience at the beginning of your ML journey is a Regression problem, and it makes absolute sense because grasping the idea of regression only requires rusty high school Linear algebra and calculus, all you need to understand is following fancy terms



Cost function

Optimization algorithm

At this point, I’m assuming that you are familiar with regression and the machinery of optimization algorithms or at least with gradient descent or gradient ascent

Be mindful of one thing the term Cost function is usually more general it’s a sum of loss function over your training set plus some model complexity penalty (regularization).

The cost function J is commonly used to assess the performance of a model, and is defined with the loss function L as follows:

If the loss is a convex shape that implies your cost function is also convex, similarly if the loss is a concave shape that implies your cost function is also concave

In the case of Gradient Descent, when the function is convex(below figure) we are coursing for global minima, whereas in the case of gradient ascent when the function is concave we are coursing for global maxima. Both global maxima and global minima are critical points on your cost function.

Cost function with global minima or maxima is a desirable property for Machine learning algorithms because we can figure out the value of parameters that give us that maxima/minima on cost function using both analytical and numerical optimization methods.

Here’s a quick review of python code for Gradient descent on cost function MSE for regression from scratch

Why MSE as loss function for Regression?

Now in the case of the Regression problem(multi/univariate linear or polynomial regression), you might have already noticed we almost always use Mean Square Error(MSE) as the Cost function

by Rohan on Medium

The question remains why do we use Mean Square Error(MSE) as a Cost function when you have a plethora of other convex functions(above figure) at your disposal that can be used as a Cost function for regression. Moreover, our Mean Square Error(MSE) is also more sensitive to outliers as compare to Mean Absolute Error(MAE) and we know the vast majority of outliers are noises hence the robust nature of MAE to outliers is something we should be eyeing for right?

Usually, you will see the rationale for the choice of Mean Square Error(MSE) over Mean Absolute Error(MAE) as a cost function given from a calculus perspective,

“The derivative of a function at a sharp turn is undefined, meaning the graph of the derivative will be discontinuous at the sharp turn”

The shape of Mean Absolute Error(MAE) is shown on the left-hand-side and we’ve shape of Mean Square Error on right-hand-side

we’ve gradient of x² on a continuum that has the same shape as Mean Mean Square Error(MSE) on (right) and we’ve gradient of |x| on a continuum that has the same shape as Mean Absolute Error(MAE) on (left) Source of tangent animations: Alice Ryhll

we’ve gradient of x² on a continuum that has the same shape as Mean Square Error(MSE) on (right) and we’ve gradient of |x| on a continuum that has the same shape as Mean Absolute Error(MAE) on (left)

On the RHS figure clearly, there is a sharp turn in the case of Mean Absolute Error(MAE) that makes it inefficient in the quest for global extrema but there’s something else notice how the gradient of Mean Absolute Error(MAE) remains the same throughout until it reaches x=0 and changes abruptly at x=0 from both LHS and RHS(right figure) now compare it with the Gradient of Mean Square Error(MSE) that smoothly changes when approaching x=0 instead of the abrupt change from one line to another(left figure) Umm okay but how it relates to our problem?

If you recall Gradient descent where we had hyperparameter α, also know as learning rate that we only assigned once depending on its value we take each step towards global extrema if the value of α is small we take small steps and we almost always converge towards extrema if the value of α is big we might diverge so usually we settle down for the balanced value of α but in case of cost function like Mean Absolute Error(MAE) we’ve to change this value on every step based on our observation of its gradient

BUT BUT BUT… What if we use Huber loss that solves all the problems related to sharp turn right? Moreover, that’s not mathematical proof we want some mathematical proof! well that’s exactly what I’m going to show you next

Probabilistic Interpretation of Regression

Let us assume there is some relationship between y(response variable) and x(independent variable) where ϵ is a random error term that is independent of x and has a mean zero keep in mind this ϵ is an irreducible error and that captures either unmodeled effects or random and y is true value

for single example(x(i),y(i))

Thus we can see that the error term is a function of y(i), x(i) and parameterized by θ

Now I want you to take a leap of faith at this point and assume these ϵ(i) error terms are distributed IID (independently and identically distributed) according to a Gaussian distribution ( Normal distribution) with mean zero and some variance σ²

thus the probability density function (PDF) of ϵ(i) is given by

substituting the value of to e(i) from above equation

The probability of the single example is given by p(y(i)∣x(i);θ)) then the probability of the data is given by p(y,x;θ). The way you read p(y(i),x(i);θ) is the probability of y(i) given x(i) and parameterized by θ

This quantity p(y,x;θ) has typically viewed as a function of y (and perhaps X),
for the fixed value of θ. When we wish to explicitly view this as a function of θ, we will instead call it the likelihood function

Likelihood function

this L(θ) is the Probability of our data that means it is the Probability of all the values of y from y(i) up to y(m) (where m is the number of data instances) given all x’s from x(i) up to x(m) parameterized θ that should be equal to this

This L(θ) is equal to the product of probabilities of y(i)s because remember we made an assumption that ϵ(i) are distributed IID (independently and identically distributed) and we know that when the probability is independent

If events are independent, then the probability of them both occurring is the product of the probabilities of each occurring.

and substituting p(y(i)∣x(i);θ)) from above we get this

now I’m going to take the Log of our Likelihood function L(θ) and will call it log likelihood 𝔏(θ)

Now, given this probabilistic model relating the y(i)’s and the x(i)’s, what is a reasonable way of choosing our best guess of the parameters θ?

The principal of maximum likelihood says that we should choose so as to θ make the data as high probability as possible. I.e., we should chooseθ to maximize log likelihood 𝔏(θ)

Hence, maximizing log likelihood 𝔏(θ) ignoring all constant terms gives us this

“look familiar”? Yes it does that’s Mean Square Error(MSE) our familiar loss function for regression so this prove shows that choosing the value of θ to minimize Mean Square Error(MSE) is the same thing as finding the maximum likelihood Estimation for parameter θ under a set of assumptions we made

‘Huzza!’ now we’ve mathematically sound, consistent and clean proof of why we use Mean Square Error(MSE) as Cost function in Regression problems


Today we’ve seen a mathematical justification for using Mean Square Error(MSE) as a cost function for regression through Probabilistic interpretation. This blog is one of the three blogs, next blog is going to be about the Cross-entropy loss function as a cost function for logistic regression and the final blog will be about Support Vector Machine(SVM) and its cost function

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…