Monte Carlo Dropout: a practical guide

Ciaran Bench
12 min readNov 12, 2023

--

A digestible tutorial on using Monte Carlo and Concrete Dropout for quantifying the uncertainty of neural networks. This article includes links to useful GitHub repos… all references are given as hyperlinks.

Introduction

If thou wilt it, inscribe my deeds in your book - that I may soar with the wings of Ramses unto a golden, pseudo-Bayesian land.

The usefulness of a model is in part determined by whether the degree of doubt we have about its outputs is smaller than some tolerance threshold. Therefore, quantifying this doubt - usually referred to as the uncertainty of the model’s outputs - is an essential part of deploying models ‘in the wild’.

Despite their versatility, effectively implementing neural network-based models into production pipelines is not a simple process. One reason is that quantifying their uncertainty is non-trivial. The uncertainty quantification methods used for non-network based mathematical models (e.g. generic Bayesian inference/optimisation) are not tractable with deep networks given the shear number of parameters involved. Consequently, a lot of work has gone into developing more efficient techniques.

One approach, known as Monte Carlo Dropout (MC Dropout), is of particular interest given i) it is reasoned from Bayesian principles (the typical framework we use to think about quantifying model uncertainty — though the technique itself may not be Bayesian), ii) the simplicity and efficiency with which it can be implemented, and iii) its scalability. Though as will be discussed, it may only provide performance benefits over other uncertainty quantification techniques in some narrow cases.

This post is an attempt to make a digestible guide to Monte Carlo Dropout and a variant called Concrete Dropout. I aim to provide a theoretical minimum, only describing the concepts needed to understand its implementation and some design choices (e.g. there is no discussion about Gaussian processes, and only superficial comments about the nature of Bayesian neural networks).

Uncertainties are expressed as distributions

When we make a measurement or prediction with uncertainties in mind, the output is usually given as a distribution of possible values. The properties of this distribution encode the doubt we have about the outcome. E.g. if this distribution has a high variance, then we have a lot of doubt about the outcome.

A typical measurement scenario: each input variable has its own uncertainty described by the properties of their distributions. These are propagated through some measurement model, resulting in a distribution of possible output values.

- Problem: generic neural networks do not output distributions

Most neural networks are deterministic — in the generic case, we optimise a single set of model parameters that stay fixed after training. If we repeatedly evaluate the same input over and over with such a network, we will get the same output each time.

Typical networks do not provide a distribution of possible output values. If we wish to compute uncertainties, we’d rather have a network like the one on the right...

When thinking about how to quantify the uncertainties of a deep network, we need to answer the question: how can we get neural networks to output distributions?

Bayesian optimisation: getting non-network based models to output distributions

A common framework used for the non-neural network case is known as Bayesian optimisation.

How can we apply this framework to deep neural networks?

Well, a neural network is in essence some equation f(θ) where θ are now the weights/biases (shown in the image below)— we aim to find the parameters θ that fits the data D .

Therefore in principle, we can also use Bayes’ theorem to derive a probability distribution of different weight configurations that describe the data well (i.e. the posterior p(θ|D) ), and use this to get a predictive distribution from which we can compute an uncertainty— these are known as Bayesian Neural Networks.

However, computing the posterior analytically in this way is not tractable — the main reason being the denominator (known as the evidence) can not be evaluated analytically. It is worth noting that this problem is not limited to the neural network case— indeed, it also affects implementations of Bayesian optimisation on more generic functions.

Variational Inference

Instead, one can fit an analytically well-behaved function (here given by q, parameterised by Φ) to the posterior distribution (here given by p), with a technique known as variational inference (i.e. we minimise the KL-divergence between the two in via an iterative optimisation):

https://towardsdatascience.com/variational-inference-the-basics-f70ac511bcea

In practice, this minimisation of the KL-divergence is performed via the maximisation of a surrogate quantity called the evidence lower bound (ELBO). We use this as we typically don’t have an expression for the posterior, but the ELBO is something we can compute as it only contains the variational distribution (that we determine), and the joint probability (the prior times the likelihood).

However, variational inference can not be used efficiently for deep neural networks given the large number of parameters. This spurs a desire for more even efficient ways to perform variational inference.

Monte Carlo Dropout

Yarin Gal et al. showed that one could skip these calculations and acquire the mean and variance of a network’s predictive distribution for a given input by following this recipe (for a regression task):

  • Train a network with dropout
  • Evaluate the given input several times using the trained network, each time applying dropout to the network’s layers
  • Compute the mean and variance of the resultant outputs…
Vanilla MC Dropout at the evaluation stage (regression) — each network shown is acquired by applying dropout to a pretrained model. The same input, x, is fed to each, and the resulting outputs form a distribution we can use to estimate the uncertainty.

Yes, it really is that simple! They showed that variational inference can be approximated with dropout regularisation. I.e. the variational lower bound can resemble the dropout objective in a generic training scenario with some specific choices for the prior and approximate posterior. Each ‘version’ of the model at the evaluation stage can be thought of as a sample from the posterior distribution p(θ|D), and each resulting output can be thought of as a sample from the predictive distribution p(y│x,D).

So Gal et al. made it computationally practical to sample from a network’s posterior and predictive distributions. Given how effective Bayesian optimisation can be for non-network based models, one might expect that MC Dropout (which is reasoned from Bayesian principles) would provide superior performance to other non-Bayesian uncertainty quantification approaches. However, subsequent works have shown this is not the case.

Weaknesses of MC Dropout

  1. It poorly approximates complex posterior distributions…

It has been hypothesised that given the way the optimisation is posed for scalable techniques that perform approximations of variational inference (e.g. MC Dropout), such techniques are prone to only sample from limited regions of ‘function space’/posterior (i.e. having simple structures for the posterior based on mean-field approximations which often underestimate the variance of true posteriors can lead to poor uncertainties) (also discussed here).

Intuitively, this makes some sense — we might expect the diversity of the models produced using MC-dropout to be low, as each is a random dropout version of some parent model.

This figure illustrates the hypothesis that variational methods like MC Dropout may only provide a good approximation of a limited region (red) of the true posterior in contrast to other techniques such as model ensembling (blue) which can cover more regions. https://arxiv.org/abs/1912.02757

2. Sensitive to choice of hyperparameters

Several works have shown that the choice of dropout rate and weight regularisation can have a considerable impact on the values of the estimated uncertainty (Gal et al. also discuss this…). To acquire reasonable uncertainties both parameters have to be grid-searched, where the optimal values are those that maximise some kind of validation metric (the specifics of this will be discussed later). However, in practice, performing these grid searches is too expensive for large models. Some variants of MC Dropout have been formulated to reduce the need to grid search over hyperparameters. I will discuss one of these (known as Concrete Dropout) a bit later in this post.

Result of using MC Dropout on a network trained to fit data generated by a known mathematical function. The uncertainty of the predictions, given by the grey regions, are shown to increase with a larger dropout rate in this case. https://arxiv.org/abs/2008.02627

Any benefits?…

With that said, MC Dropout is more computationally efficient than other uncertainty quantification techniques (e.g. ensemble-based techniques), as the network only has to be trained once. So despite the uncertainties being somewhat weakly justified/poorly calibrated, the technique could be an optimal choice in cases where training is very expensive (e.g. very large models), or when the true posterior is likely to be very simple.

Implementing MC Dropout

The set of bullet points I provided above describe the basic setup, but when implementing this technique in practice there are several things to consider. Chief among these is the type of uncertainty you are interested in quantifying.

There are several sources of uncertainty that should be considered when quantifying the uncertainty of a measurement/model. The two most commonly encountered are known as aleatoric, and epistemic uncertainty. MC Dropout is performed slightly differently depending on which uncertainty you are interested in quantifying (the implementation will also change depending on whether it is a regression or classification task, but we will get to this later). Aleatoric uncertainty refers to random uncertainty intrinsic to the data (e.g. random measurement noise). Consequently, this kind of uncertainty is not reducible. Epistemic uncertainty can be thought of as model uncertainty emerging from a a lack of information (e.g. not enough training data, or the inherent lack of a model’s ability to capture the behaviour of the data). This kind of uncertainty can be reduced with additional information.

- Acquiring epistemic uncertainty

The recipe provided in the bullet points above is the procedure for calculating the epistemic uncertainty for regression models. For classification, one instead averages the probability distributions predicted by each dropout model, and then computes a quantity called the entropy from this averaged distribution.

Examples of epistemic uncertainty quantification can be found in this repo.

- Including aleatoric uncertainty

Note: Please look forward to a preprint/article in the coming months that will discuss this formulation/expression of uncertainties with more clarity : )

To incorporate aleatoric uncertainty, the model architecture is changed so that it now outputs two quantities: the original predicted quantity, as well as a new term for the data uncertainty. The use of a special loss function allows this latter term to be learned implicitly without a ground truth:

Where \hat{\sigma} is the predicted data uncertainty, y is the ground truth, and \hat{y} is the predicted output https://arxiv.org/abs/1703.04977

For regression, the uncertainties are computed with the following equation:

One takes the mean of the square of each output of the predicted quantity \hat{y} produced by all T dropout models, subtract it by the square of the mean of all the outputs, and add the mean over the squared values of the network’s output data variance. https://arxiv.org/abs/1703.04977

Whereas for classification, the data variance term is used to parameterise Gaussian noise, which is then used to corrupt the model’s corresponding predicted logits (e.g. via addition). Each corrupted set of logits is then put through a softmax activation. The resulting distributions produced from each of these models are averaged. The same process can be used for the evaluation stage, with additional averaging over the Monte Carlo dropout samples. The entropy can computed from this averaged distribution to get the uncertainty for a given sample. A modified CE loss is also implemented in this case (see Equation 12 here). A version of this scheme has been implemented in the following repo.

- Revisiting the role of the dropout rate and weight regularisation

As mentioned above, one big criticism of MC Dropout is that the dropout rate and weight regularisation have a considerable impact on the values of the estimated uncertainties. For example, it has been observed that large weight magnitudes correspond to larger uncertainties.

A grid-search can be implemented to find the values of each parameter that maximises some validation metric (e.g. the validation log-likelihood when searching over values of the dropout rate, allowing the model to choose smaller rates that decrease epistemic uncertainty while still retaining a good/generalisable fit to the data). In practice, a quantity known as the prior length scale is used to compute the weight regularisation parameter (along with a constant for the inverse of the observation noise — these will be discussed more in the subsection on Concrete Dropout). This is the quantity that is grid searched. However this grid search is computationally intractable for larger models, spurring a desire for more efficient methods.

Concrete Dropout: avoiding a grid search over the dropout rate

Concrete Dropout was proposed to eliminate the need to grid search over the dropout rate (though, not the weight regularisation), making it more efficient to derive well-calibrated uncertainties with MC Dropout. This is achieved by making the dropout rate a trainable parameter.

Gal et al. describe a fundamental trade-off between keeping epistemic uncertainty as small as possible, while still allowing the model to describe the data well. This latter constraint is important to consider, as otherwise the model could just learn weights of zero to minimise the epistemic uncertainty in the case where the dropout rate is fixed. Allowing the dropout rate to vary provides a way to maintain this balance.

In `vanilla’ MC Dropout, discrete Bernoulli distributions (parameterised by a dropout rate) are sampled to implement the dropout regularisation. However, the properties of these distributions make it challenging to learn their optimal dropout parameterisation in a differentiable framework. So Gal et al. imposed a continuous relaxation to this distribution — i.e. a concrete distribution parameterised by a quantity known as the temperature (which determines how ‘hard’ the distribution is — i.e. how closely the sample values lie along 0 and 1), a random number, and the dropout rate. The authors suggest default values for the temperature — the implicit suggestion being that that the although this parameter may not be optimised, its effect is not substantial enough to need explicit optimisation.

The ideal value for the weight regularisation is acquired by grid searching over the prior length scale, which is used to compute the weight regularisation parameter. This prior length scale is discussed in the appendix (section 4.2) of the `original’ MC Dropout paper:

“The length-scale is a user specified value that captures our belief over the function frequency. A short length-scale l (corresponding to high frequency data) with high precision τ (equivalently, small observation noise) results in a small weight-decay λ — encouraging the model to fit the data well. A long length-scale with low precision results in a large weight-decay — and stronger regularisation over the weights. This trade-off between the length-scale and model precision results in different weight-decay values.”

For example, in the Concrete Dropout paper, they apply the technique to the MNIST dataset and aim to find the value for the prior length scale that achieves the best balance between the accuracy of the model (how often the model is correct), and its predictive log likelihood (how well the model predicts the probability of an outcome occurring/how well it fits the data) when applied to the test set to optimise classification performance.

Excerpt from the Concrete Dropout paper describing the grid search over the prior length scale… https://arxiv.org/abs/1705.07832

The output precision parameter (the inverse of the observation noise) is also used in the calculation of the weight regularisation — its value may derived based on assumptions about how the data is distributed, or how noise is introduced into the measurement of your data. If the observation noise is not known a priori, one could make some kind of estimate, take this inverse of this, and then grid search around the value.

Some implementations of Concrete Dropout can be found here.

Summary

MC Dropout provides a means to quantify the uncertainty of outputs produced by a neural network. `Under the hood’ it approximates variational inference, and can be thought of as allowing us to sample from the posterior and predictive distributions of a given model. However, the uncertainties are often not well-calibrated — their values depending strongly on the hyperparameters chosen for training (e.g. the dropout rate). Various grid searches can be used to find the optimal values of these hyperparameters, but in practice this approach is intractable for large networks.

Concrete Dropout removes the need to grid search over the dropout rate. Instead, an optimal value is derived as part of the training objective. However, there remains a need to grid search over the weight regularisation and choose other hyperparameters. In general MC Dropout may underestimate the variance of the true posterior, leading to poor uncertainties even when these hyperparameters are optimised.

Despite these drawbacks, the technique scales well to larger models given only a single training run is needed. It is also straightforward to implement, only requiring several forward passes for each input. Therefore, it remains valuable tool for quantifying the uncertainties of deep networks.

Thanks for reading.

Follow me on X @ ciaranbench .

My personal website is: ciaranbench.github.io

--

--