How to make your neural network predict its own error

A typical neural network that predicts one or several continuous values, such as tomorrow’s temperature or a person’s alcohol consumption in the next week, uses the mean square error (MSE) as a loss function. So if we denote model prediction with z_i, and the true value with y_i for sample i, the loss function is something like

This seems nice and intuitive, but has two serious drawbacks: firstly, fitting a model like that gives you no understanding of the model prediction error for any given point.

Secondly, if your prediction error has varying levels of noise (be it because you’re simultaneously predicting several variables, some noisier than others, or because some samples from your dataset are harder to predict than others), the noisier bits will dominate the gradient update, so if one of the predicted variables is very noisy, it will impair training for the others.

Log-likelihood to the rescue

Here is a simple trick that addresses both of those problems: instead of your models generating points as predictions, let it generate distributions! And change your error from MSE to ∑_i log( p(y_i | z_i, other_params)). Here p is the probability density of your assumed error distribution, parametrized by your model output (z_i and other_params).

For example, if you want to treat your error as normally distributed (simplest case), then in addition to the actual prediction z_i, the very same model should also generate its expected log variance v_i (just double the number of output units in the final fully-connected layer that your model most likely has), and your loss function becomes

Look at the structure of this: firstly, the gradient descent tries to minimize the value of v_i to decrease the second term. But if it does that too much, the first term starts to matter more, and thus v_i converges to something pretty close to actual log variance of the model. Secondly, you see that the squared-error terms are divided by their variances, so the more noisy ones are automatically de-weighted and no longer mess up the improvement of the less-noisy ones.

If z_i and y_i are vectors, v_i has to be a vector of the same size, and the log-likelihood is computed element-wise (corresponding to the assumption of independent errors). You could also fit or pre-specify some covariance between the errors, but that’s beyond the scope of this post (for the fun games one can play with covariance matrices, check out this notebook)

If your prediction error is very non-normal (if you’re not sure, fit a model and look at the histogram of the errors to find out), you can use the log-likelihood for the distribution with the most appropriate shape, for example the Skew-Normal distribution (meaning the model will need to generate a skew parameter in addition to the log-variance).

So by making a one-line change to your loss function, you can get your model to predict its own prediction error, and get better fit quality into the bargain — isn’t this nice?