Uncertainty in machine learning models and Gaussian processes

Published in

EcoVadis Engineering

5 min readSep 13, 2022

In this article, we’ll try to explain what model uncertainty is and why we should care about it in machine learning models, along with an example of how to deal with uncertainty using Gaussian process models.

Introduction

Let’s start with an example to illustrate the problem we are facing. Imagine we train a neural network model to predict the probability of an image containing a car. Our model has a really good accuracy when we use these probabilities to classify images with cars. At some point, we give our model an image of something that has never been seen by it, for example, a penguin. The model says the image contains a car with a probability of 97%. So, based on this example, would you trust the model’s prediction? It turns out it is a typical behavior in neural networks, where they tend to give high-confident predictions far away from the training data [1]. This ability of knowing what we don’t know is what we are looking for, as it can be critical in some decision making applications like autonomous cars [2]. For example, if the model fails to classify a pedestrian as such, our AI agent (the car) could make the wrong decision with catastrophic outcome.

What is model uncertainty?

It is worth clarifying that when we talk about machine learning models, there are two types of uncertainty [3]:

Aleatoric uncertainty: It comes from randomness in the data-generation process, for example, measurement noise in an experiment. This uncertainty cannot be reduced no matter how good our model is.
Epistemic uncertainty: It comes from lack of knowledge, which in machine learning means that either we don’t have enough data or our model is not expert enough. This uncertainty can be reduced if we get new information.

Having this in mind, let’s look at the specific example of uncertainty estimates given by Gaussian processes.

Gaussian process regression

Gaussian processes (GPs) are Bayesian machine learning models. This means that for a given data input point, we’ll obtain a predictive distribution instead of a point estimate like we had in neural networks. The predictive variance can be interpreted as an estimation of the total uncertainty.

But, how do these models work? Let’s imagine we have the following regression problem

The problem to solve is to find the function that better fits the data. For that, we assume that the observations y are noisy observations of the function that generated the data. Namely,

where ϵ ∼ 𝒩(0, σ²) is some additive Gaussian noise and the variance σ² corresponds to the aleatoric uncertainty.

With GP models, one assumes that the function that generated the data can be drawn from a Gaussian process, which can be seen as a distribution over functions [4]. We can characterize a GP by a mean function m(x) and a covariance function k(x,x), i.e.,

In practice, we put a GP prior on the function p(f) ∼ 𝒩(0, k(x, x)), where it is a common practice to set the mean function to zero to simplify the required computations without loss of generality. This results in that the GP is fully characterized by its covariance function or kernel. The choice of the kernel function will then determine the main properties of the learnt function (smoothness, stationarity, etc.). A common kernel choice is the radial basis function or RBF kernel, which assumes smooth and stationary functions.

If we combine this prior knowledge with the observed data using Bayes rule, we will obtain a posterior distribution over the function values.

An interesting term is the one in the denominator, called marginal likelihood or model’s evidence, is a normalization constant that can be used to train the model and find the best parameters for the kernel. When we draw samples from this posterior we get functions that meet the given properties and are compatible with the data.

Samples from the posterior distribution of a GP

Note that in the regions where we don’t observe data, the variability of the functions is higher than in regions where we have data points. The interesting part is that we can obtain a predictive distribution using this posterior to make predictions and their associated uncertainty estimates, and most importantly, this distribution can be computed in closed-form. We won’t go into many details, but it will be a Gaussian distribution with some mean and variance [4]. The predictive distribution for this problem will look as follows:

The blue line in the plot represents the mean value of the predictions and the shaded blue area represents the predictive variance or the total uncertainty. As expected, the model is more confident about its predictions near the observed data and the uncertainty will grow as we move far away from this data.

If you are interested in the implementation, here’s the code to generate the last figure using Python and GPflow [5]:

Conclusions

In this article, we have seen that some machine learning models can give overconfident predictions outside the training data, which can become an issue in real applications. We have also shown how Gaussian process models can give us the accurate uncertainty estimates that we need to build trustworthy decision making applications.

References

[1] Hein, M., Andriushchenko, M., & Bitterwolf, J. (2019). Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 41–50).

[2] Gal, Y. (2016). Uncertainty in deep learning. PhD thesis. University of Cambridge.

[3] Hüllermeier, E., & Waegeman, W. (2021). Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3), 457–506.

[4] Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2, №3, p. 4). Cambridge, MA: MIT press.

[5] A. G. Matthews, M. Van Der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 2017.