[Bayesian DL] 3. Introduction to Bayesian Deep Learning

Published in

jun-devpBlog

5 min readApr 21, 2020

1. What is Bayesian Neural Network?

A Bayesian neural network(also called BNN) refers to extending Standard neural networks(SNN) with assigning distributions to its weights.

While the weights of standard neural networks have particular values for being multiplied to its input, the weights of a BNN have distributions. This means that in the forward steps, the values of weights are stochastic as the values are drawn from its distribution, not deterministic as SNN. Of course, this leads the network outputs also to be stochastic even in the case when the same input is repeatedly given.

More specifically, suppose the case when we feed the same input several times to the network. As each weight in SNN has fixed trained-value the SNNs will always give the same output(deterministic). In BNN, however, the value of each weight is randomly drawn from its distribution. Therefore, the outputs from the several forward passes of BNN given the same input won’t be the same(stochastic).

Figure 1 clearly illustrates the difference between SNN and BNN.

Figure 1. from here, Example of CNN in SNN(left) and BNN(right)

2. Uncertainty: Why do we assign distributions to network weights?

Standard deep learning architectures do not allow uncertainty representation in regression settings. In NN for classification, the softmax outputs are often interpreted as the model’s confidence but still, they do not capture the model uncertainty.

Then one question rises.

: Why do we need to capture the model uncertainty?

In the classification tasks, our model outputs the label of the element in softmax-output vector which shows the highest probability(confidence).

Let’s take a look at the figure below.

Figure 2. The softmax output, given input images below the distribution

This is visualized the softmax output given the MNIST images. In this example, there are only the image of ‘1’ and its rotated version as is shown at the bottom of figure 2. For the images of ‘1’ at the bottom left corner, which is either not rotated or slightly rotated, the model correctly classifies it as ‘1’ with high probability(confidence).

However, as the ‘1’ gets rotated, the model begins to be not sure about the given images and it classifies the given images(the bottom-middle part) as ‘5’.

Ideally, since there is no such number existing, it is desirable for a model to tell us that I don’t know, instead of returning uncertain predictions.

This is where the concept of ‘uncertainty’ is introduced. Let the model say ‘I am not sure about the given input’ so that the human take over such uncertain input and classify themselves.

Therefore, In order to capture this uncertainty from a model, we assign distributions to the model weights as the variance of distribution reflects the uncertainty. (Making our model capable of saying ‘I don’t know)

Figure 3 is another example of a BNN(righthand side). As its output unit Y is computed by the sum of products of the activations H and corresponding weights from distributions, Y also has a distribution. Thus, The variance of the distribution of Y can be used to measure how uncertain our model is for its output.

3. How to train and output the model prediction

What we want to eventually get using Machine(Deep) Learning is the distribution of input x and output y, p(y*|x*, X, Y), where X and Y is given training set(events that happened), y* is what we want to predict given input x*.

Figure 4. from here, the distribution P(y*|x*, X, Y) where X and Y are represented as red dots

However, directly calculating p(y*|x*, X, Y) is a difficult task. So what we have previously done is using some techniques that output reasonable predictions and can be easily calculated such as the Maximum Likelihood Estimation(MLE), which compute the weight 𝚯 that maximizes the likelihood p(y|x, 𝚯), and the Maximum A Posterior(MAP), which compute 𝚯 that maximizes p(𝚯|x,y).

However, since those techniques only compute a particular value of 𝚯, not a distribution and we want a distribution, we need to apply a different method called Predictive Distribution(Bayesian Prediction).

Figure 5. The equation for Predictive Distribution

How to derive this formula simply begins with the marginal distribution as is shown below.

Figure 6. derive the predictive distribution

As the Posterior is intractable, we use another technique which is called the variational inference to approximate it.

We will deal with this in detail in the next chapters. But briefly explaining, we assume one distribution q(𝚯) then try to make q(𝚯) similar(approximate) the posterior p(𝚯|X, Y) by reducing the distance computed by KL divergence. In this way, we update the weight distribution q(𝚯) in the training step and substitute p(𝚯|X, Y) by q(𝚯) in the test step to compute the approximated distribution p(y*|x*, X, Y).

Note that we are dealing with distributions, not a particular value that maximizes likelihoods like in MLE, MAP.