ML notes: why the Least Square Error?
Disclaimer:
This blog post is in my “ML notes” category. Its goal is to help me make sure I understand the tools and theories used in ML. I believe that pedagogically explaining what I learn, removing step by step any unknowns, is the best way to achieve this goal.
What do you need? A little bit of mathematical background in calculus, algebra, probability and machine learning (mainly definitions).
If you find any mistakes, please get in touch. I would be very sad to keep wrong beliefs in my head and discover them too late, Thanks!
Making the log-likelihood explicit
In my last note, I’ve written about the many practical and theoretical reasons explaining why the log-likelihood is often used in ML algorithm. But I stopped quite short from reaching an explicit expression that could be implemented.
Let’s dig a little further by exploring how this log-likelihood idea produces what we call the Least Square Error (LSE) under two main ML concepts: Supervised Learning (SL) and the Additive White Gaussian Noise model (AWGN).
I will only explore the LSE from the probabilistic viewpoint (I won’t explore how the LSE can arise/be justified from calculus or linear algebra). At the end of the article (remark 5), you will find a link to its cousin( the Mean Squared Error or MSE) many interesting properties.
Supervised learning (SL)
In my last note, we ended up showing that “maximising the likelihood of the model parameters given a dataset” is equivalent to “maximising the sum of the data log probabilities” (taking into account the I.I.D. assumption, see previous note for more information):
Now, I want to focus on what we call supervised learning, in which:
- A datum “d” is a pair “{x, y}” consisting of an input object “x” and the desired output value “y”
- The output “y” is dependent on “x”
- The goal is to build a model which predicts “y” given “x”
Let’s apply this setting to the probability of a datum:
- (2) is coming from the definition of a datum in supervised learning
- (3) is the definition of a joint probability in respect of the conditional probability
- (4) results from the fact that the intrinsic probability of the input does not depend on the model or its parameters.
Since the probability of “x” does not depend on “theta”, we can apply (4) in (1) and obtain:
So, in order to maximise the likelihood of the parameter in supervised learning, we can focus on maximising the sum of the output log-probabilities given the corresponding input. And that’s great because programmatically this aligns with the definition of a function.
Note: To complete the optimisation procedure, the joint probability notation is perfectly fine from a theoretical perspective, so why focusing on the conditional notation?
This is again due to engineering constraints. To compute the joint, you need to build a function which takes both “x” and “y” as inputs and output one probability: the joint one. To compute the whole joint distribution you have to batch all the “y” for a given “x”. Now if you want to do mini-batch, you have to batch multiple of those batches… This leads to a lot of needed GPU power.
An optimised way to achieve the same result is to approximate the conditional distribution directly using our model: we can build a function that takes only one input “x” and computes directly all the “y” at once. Our model is approximating the distribution of the conditional probability itself. This reduces heavily the number of needed operations and so, increases the speed of our algorithm🔥
Back to real life! We know we can build a parametrised function in TF representing our family of models. It should take our “x” as inputs and outputs some values that can be interpreted as the conditional distribution of “y” given “x”. We have our model now, the last thing we need is a way to optimise (train) the function’s parameters so our function approximates the categorical distribution defined by our labels in our dataset.
So, we need an objective. This objective can be to compare the 1-hot vector label (Categorical distribution) and the outputs of our function to reduce any difference between them for each datum. But how to justify any comparison? How to compare those values to be efficient, robust and actually helpful to generalise? Should we use the absolute difference? squared error? Some other exotic norms?
There is an infinite number of possible ways to do this and designing objective functions is still an ongoing active area of research. So, how can we justify the use of the LSE from the probabilistic viewpoint?
The additive white Gaussian noise model (AWGN)
A dataset is nothing more than a big number of samples coming from a stochastic process. Whatever the cause of the stochasticity reflected in our dataset (either coming from missing information or measurement errors), we can models it with noise.
Assuming that our dataset is coherent (see remark 1), there are two common ways of introducing noise in a problem:
- The noise can be independent of our data, we call this noise additive. It is generally introduced by human errors when labelling and/or sensor inaccuracy.
- The noise can be dependent on our data, we call this noise multiplicative. The multiplicative noise can model missing information like missing interesting input features to predict the output.
So, what noise should we add? We will go with the simplest possible noise model (this choice is sponsored by the Occam’s razor): the Additive White Gaussian Noise model which states that:
- There exists a function relating the input to the output, but our measured output is subject to additive noise.
- All the noise is generated by a white Gaussian distribution: the mean is
0
and the noise is identically distributed - All the noise random variables are independent between each other and are independent of the data of our dataset
Remember that we would like to model the function “h” in the equations above. We will assume that the class of model we choose (all the functions that can be approximated with our model) can approximate, as well as we want, this function “h”.
Given this model of the relation between our data, we can roll some math and write down explicitly the probability of “y” given “x”:
- (8) We are replacing the value of the random variable Y
- (9) We use the fact that “x” is given and the random variable Z is independent of X, ”theta” and “m”
- (10 & 11) We apply (6) and take the log
- (12) We remove any constant from the “argmin”. We only keep the 0.5 because of mathematical convenience when we will compute the derivatives.
A wild squared error appeared! Notice that we can divide by the number of elements to retrieve the Mean Squared Error (MSE) for free.
Remark 1
It has been shown that if you generate a random dataset for supervised learning, you can always build a big enough model (in terms of capacity) that will overfit (memorise) the data and reach a training loss of 0
😨
This is important because it means that with any dataset you can fool yourself into believing you solved a task where you were actually only “hardcoding” your dataset relationships.
This is even more important for datasets creators whose must be rigorous: when one gathers a dataset for supervised learning, one is already making the assumption that the gathered input and output data have a mutual information relationship (correlated).
Just to give you an example, I could gather a dataset between the position of the moons as inputs and national lottery results as outputs. Even though they clearly not correlated, you can find a model that will overfit this dataset.
Remark 2
When using the AWGN, we introduce an irreducible stochastic error per sample. You know that the predicted values must not be equal to the noisy output values as it would mean that you are overfitting your training set.
On average (meaning for a big enough number of points), each prediction should have an error equal to the standard deviation of the additive noise.
Especially, if you compute the Mean Squared Error on a “big enough number of datum”, we should be close to the variance of the noise (square of the standard deviation). A concrete example of this can be found in the code section below.
Remark 3
One should never forget what it costs us to use the maximum likelihood estimation procedure with supervised learning and the AWGN model. Here is the list of assumptions we make to justify the LSE training phase:
- We assume that our data in the training set are i.i.d.
- We assume that there exists a mutual information relationship between our input and output
- We assume that we can approximate this relationship using the family of models we chose.
- We assume that there is only one source of noise
- We assume that the noise is additive on the output values
- We assume that this additive noise is from a white Gaussian distribution
- We assume that this additive noise is i.i.d. and independent of our data
And now the list of assumptions to justify why our model should be able to actually have a high accuracy on the test set and be able to predict new data:
- We assume that our data in the test set (and future data) are i.i.d. too
- We assume that our data in the test set (and future data) are identically distributed with the training data. For example, probability distributions in real life are usually not time independent: they evolve over time effectively breaking these assumptions.
- We assume the test set (and future data) is subject to the exact same source of noise
And assuming all of this, we develop some slick learning algorithm and pretend it’s going to work … and more times than often, it actually does! 🍻
More seriously, I believe keeping those assumptions in minds helps one to be very careful when crafting a new dataset and even better, test them afterwards.
Remark 4
A note on a subtle difference in notations that you might find in the wild:
The first notation states that “theta” and “m” are random variables, the second notation (notice the semicolon) states that “theta” and “m” are just parameters (nothing to do with a probabilistic point of view). Both are fine, just the interpretation differs, which affects the theory you allow yourself to apply.
In my case, I want to state that we chose a family of model and its corresponding parameters based on some prior knowledge that one could model with a probability distribution, the Bayesian point of view.
The notations are very complex in the ML world because it’s a world mixing multiple mathematical fields (probability theory, linear algebra, calculus at least). So please, strive for preciseness!
Remark 5, more properties of squared errors
I can’t write an article on the LSE without mentioning this awesome question on stack exchange about the Mean Squared Error (MSE).
Go read it, it contains a lot of insight!
Time to code
Let’s confront the theory with reality, following the scientific method! To do so, we will build a toy example: we will try to approximate the sinus function on a given interval. This is called a regression in SL.
Let’s dive into the code, everything is in the comments.
Here is the training phase:
And here are the resulting graphs:
Some remarks on the experiment:
- We succeeded pretty well to reconstruct the sinus function, even with strong additive noise
- Our optimisation algorithm loss is indeed oscillating around the irreducible error: 0,5
- We have a more fine-grained improvement when we lower the learning rate
- If the testing set is not I.I.D. with the training set, it does not work! This example might seem obvious, but if you work with images as inputs, it’s a lot harder to tell when you are too far of the training dataset manifold.
Anyway if you read this far, thank you!
Cheers! 🍺
Thanks to Alex Orange for all the feedback!
ML notes series
- Why the log likelihood?
- Why the mean squared error? (this one ✍︎)