[Bayesian DL] 4. Types of Uncertainty

Published in

Learning

4 min readSep 10, 2020

There are two major types of uncertainty we can model in Bayesian modeling, which are aleatoric uncertainty and epistemic uncertainty. In this article, we will take a look at the definition of them and how they are formalized as the probability distributions.

1. Aleatoric Uncertainty

The aleatoric uncertainty, which is also called data-inherent uncertainty, denotes the intrinsic noise in the observations(data). Noise from sensors could be one example of such uncertainty, and clearly the aleatoric uncertainty cannot be resolved by collecting more data since it is randomness inside real data.

Further, the aleatoric uncertainty can be decomposed into homoscedastic uncertainty and heteroscedastic uncertainty. Homoscedastic uncertainty is invariant to different inputs, meaning it remains consistent regardless of inputs. Heteroscedastic uncertainty, on the other hand, changes over different inputs to a model. In other words, for some inputs, it could output more noisy results compared to when the other inputs are given.

Heteroscedastic Aleatoric Uncertainy

In order to capture the aleatoric uncertainty, [1] says we would have to tune the observation noise parameter 𝝈, which can also be interpreted as the standard deviation. As mentioned before, homoscedastic regression will output the constant observation noise 𝝈 for all input data point 𝒙, whereas 𝝈 from heteroscedastic regression will vary according to their definition(assumption). Heteroscedastic models are helpful when some parts of the observation space might have higher noise levels than others.

**Figure 1. Image with vanishing point, from** **[2]**

One example is shown in figure 1. Think of an image with a vanishing point in computer vision task. As lines in the image converge around the vanishing point, the pixels near the vanishing point would be likely to be noisy than pixels away from the point.

Now we know, different pixel(or input data) can have different noise-level 𝝈. Then the remaining question is how do we measure this 𝝈.

From now on, for the sake of simplicity, let’s assume our given task is a 1D-regression problem.

[1] explains that a neural network can be trained to learn 𝝈 as a function of the input data 𝒙 with the help of following loss function.

Figure 2. Loss function for Heteroscedastic models from [1]

N denotes the size of dataset and 𝑓 is a model represented as a function. This loss is derived by assuming that the dataset follows a Gaussian distribution and applying the negative log-likelihood afterwards. We should keep in mind that here, the MAP inference is used instead of the variational inference. This means a model trained with this loss function will find a single value for each model weight 𝚹 and always output the same 𝝈 for given same input 𝒙. In other words, the model is deterministic but 𝝈 represents the level of noise.

However, above loss does not capture the epistemic uncertainty. This is because the epistemic uncertainty is a property of the model and not of the data.

2. Epistemic Uncertainty

Epistemic uncertainty is related to model parameters 𝚹, and it arises when the model is not suitably trained due to the lack of training data. In other words, it can be reduced and explained away once enough data is given, and therefore, it is referred to as model uncertainty.

In order to capture epistemic uncertainty, as explained in previous articles, a prior distribution is assigned over each weight in a neural network which could be a Gaussian prior distribution as one example.

**Figure 3. Common choice of a prior distribution of network parameter W**

As now the network weights are random variables from corresponding variables, network output is stochastic and this random output is used to measure the epistemic uncertainty later. However, simply assigning Gaussian distribution could be not enough in some cases and therefore it might need more sophisticated steps like below to build a proper Bayesian Neural Networks(BNNs).

**Figure 4. Variational inference for BNN**

please refer to this link if it is not clear why the posterior predictive distribution p(y|x, X, Y) is related to untraceable posterior distribution over wights p(W| W, Y)

Fortunately, instead of following the complex mathematical steps explained above, we can simply use [3]Dropout variational inference(MC-Dropout) to achieve the same effect as is proved in the paper. This inference is conducted by training a standard neural network with dropout for every weight layer and keeping dropout active for evaluation phase as well.