# Bayesian Convolutional Neural Networks with Bayes by Backprop

So far, we have elaborated how *Bayes by Backprop* works on a simple feedforward neural network. In this post, I will explain how you can apply exactly this framework to any convolutional neural network (CNN) architecture you like.

You might have seen Gal’s & Ghahramani’s (2015) publication of a Bayesian CNN, but that’s an entirely different approach and, in my opinion, not comparable with Bayes by Backprop. Personally, I would not even speak of a Bayesian CNN with their method, but we won’t cover the differences in detail here. Feel free to read Shridhar et al. (2018) who discuss the discrepancies.

If you are a proponent and user of TensorFlow, Dustin Tran and colleagues just implemented *Bayes by Backprop* in the TensorFlow probability library. They presented it at NeurIPS 2018, see library here and paper here. Nonetheless, I would like to stress that Shridhar et al. (2018) were the first who implemented *Bayes by Backprop* into a CNN. They wrote their code in PyTorch, which you can view here. But enough about who, what, and where, let’s better get started.

# Backpropagation Recap

We can see any deep learning model as an amalgamation of some inputs *x*, weights *W*, non-linear transformations *g(·)*, and outputs *ŷ.* First, inputs *x* are multiplied by weights *W* before their product is non-linearly transformed — e.g. with the *sigmoid* function how it is symbolised in the graph below. The outputs *ŷ* of the entire model are simply the result of the last non-linear transformations *g(·)*.

And how does the model actually *learn* anything? It updates its weights *W* such that as many outputs *ŷ* become equal to some labels *y* as possible. To do so, we first calculate the cost, error, or loss *C* of a given batch *N* of data examples. You can take any cost, error, or loss function, but here we choose for illustrative purposes the rather simple Euclidean loss.

We’ll speak of

lossfrom here on, but use whatever phrase you prefer. Loss, error, and cost are all the same.

Second, we calculate how much “responsibility” each of the weights has for the loss of a given batch. This is exemplary done for the weight *w11* here. The first *1* stands for the layer and the second *1* for the position in this later. Hence, *w11* is the weight being multiplied with the inputs *x *and this product in non-linearly transformed in node *1* of the first layer of activation functions.

So, we firstly take the weight value for *w11* how it was initialised (or from previous training iteration if we are in the midst of training). Secondly, we subtract from it the partial derivative of the total loss with respect to any given weight, here *w11*, which is our aforementioned “responsibility”, multiplied by a learning rate *α*.

This will update each weight value *w* such that the overall loss is minimised, i.e. more inputs *x* are classified correctly as outputs *ŷ = y*. This is precisely what happens in backpropagation.

# Probabilistic Backpropagation

I explained in my *Bayes by Backprop* post how backpropagation is applied in probabilistic deep learning models, but will shortly recap it here. As you know, the main difference is that we do not deal with a single point-estimate as a value for a weight *w* but with a probability distribution, parameterised by *θ*. In case of a Gaussian distribution we have two values, mean *μ* and standard deviation *σ*, to learn and not only one. We denote this by *θ = {μ, σ} *and just assume its a Gaussian distribution whenever we speak of probability distributions. This makes it much easier to understand in the first instance. So, the issue is that we cannot take any derivative from two values at once, to say it in a rather non-technical language.

But, we have found a way to circumvent this issue by defining a variational distribution *q* approximating the true posterior distribution *p*, which is intractable in Bayes’ rule, sampling from it, and applying Kingma’s & Welling’s (2015) reparameterisation trick to keep information of the distribution parameters *θ. *The goal is to have this variational distribution *q* as similar as possible to the true posterior distribution *p *around a local minimum of the true posterior probability distribution *p* which gives us eventually a small loss (see previous post for details).* *The maximum “similarity” is constrained by the evidence lower bound (ELBO).

This entire procedure is known as *Bayes by Backprop* or simply variational inference. Please refer to my *Bayes by Backprop* post if this was all a bit too fast for you.

# Bayesian Convolutional Neural Networks with Variational Inference

As you might guess, this could become a bit tricky in CNNs, because we basically do not only deal with weights standing alone how we do in feedforward neural networks, we deal here with filters which can be seen as collections of weights forming a new entity. But, we still place probability distributions over weights in these filters. Below, I plot **left** a CNN with single point-estimates as weights, and **right** a CNN with probability distributions over weights to give you a sense of comparison.

As you might know, many CNNs consist of filter layers which build feature maps, pooling layers, and fully-connected layers to do a final classification. See the graph below for a basic illustration.

## Local reparameterisation trick for convolutional layers

Recap all what was necessary to apply backpropagation to a feedforward neural network with probability distributions over weights: defining a variational distribution *q*, sampling from it, and applying the local reparameterisation trick. We do this in a slightly truncated manner for CNNs: we do not sample the weights *w*, but we sample instead layer activations *b* due to its consequent computational acceleration. The variational posterior probability distribution

(where *i* and *j* are the input, respectively output layers,* h* and *w* the height, respectively width of any given filter) allows to implement the local reparameterisation trick in convolutional layers. Note our new definition *αμ²* of the variance of a Gaussian distribution. We multiply a scaling factor *α *with the mean *μ, *to the power of *2*. This results in the subsequent equation for convolutional layer activations *b*:

where *ϵj ∼ N(0, 1)*, *Ai* is the receptive field, ∗ signalises the convolutional operation, and ʘ the component-wise multiplication.

## Applying two convolutional operations for mean and variance

The crux of equipping a CNN with probability distributions over weights instead of single point-estimates and being able to update the variational posterior probability distribution *q* by backpropagation lies in applying two convolutional operations whereas filters with single point-estimates apply one. Since the output *b* is a function of mean *μ* and variance *αμ²* among others, we are then able to compute these two variables determining a Gaussian probability distribution separately.

We do this in two convolutional operations: in the first, we treat the output *b* as an output of a CNN updated by frequentist inference. We optimise with Adam towards a single point-estimate which makes the validation accuracy of classifications increasing. We interpret this single point-estimate as the mean *μ* of the variational posterior probability distributions *q*. In the second convolutional operation, we learn the variance *αμ²*. As this formulation of the variance includes the mean *μ*, only *α* needs to be learned here. In this way, we ensure that only one parameter is updated per convolutional operation, exactly how it would have been with a CNN updated by frequentist inference.

In other words, while we learn in the first convolutional operation the maximum-a-posteriori (MAP) of the variational posterior probability distribution *q*, we observe in the second convolutional operation how much values for weights *w* deviate from this MAP. This procedure is repeated in the fully-connected layers.

# Experiments with Bayesian CNNs

Let’s have this time not only theoretical explanations, but also look at some examples. As I mentioned earlier, Gal & Ghahramani (2015) used Dropout to approximate the intractable posterior probability distribution *q* and spoke then of a Bayesian CNN. Despite the methodological deficiencies, the results perform comparable to, for CIFAR-10 even better than ours. We used for the results in the table below LeNet-5 and AlexNet and compared results achieved by frequentist and Bayesian inference.

Next, we show how Bayesian CNNs incorporate naturally a regularisation effect. This phenomenon might be called model averaging in other literature. While an AlexNet trained on CIFAR-100 greatly overfits by frequentist inference, we do not see any signs of overfitting in Bayesian inference. Furthermore, Bayesian inference is comparable to using three layers of Dropout, if we only address the regularisation effects — still, this doesn’t allow us to speak of Bayesian methods when implementing Dropout.

And lastly, let’s see how a variational probability distribution *q* of a random weight *wij* actually changes over epochs. Here, plots for epochs *1, 5, 20, 50, *and *100* are given.

And the decrease of the standard deviation is even fairly smooth over the entire training duration of *100* epochs.

This is pretty neat, because we can say that our model becomes securer or more confident with making decisions, while it still incorporates aspects of uncertainty, namely the variance. Adopting this formulation, we can say that frequentist inference models are overly confident in their decision, because they do not incorporate any aspect of uncertainty.

This was all about Bayesian CNNs with *Bayes by Backprop* you need to know. I hope you enjoyed reading. Let me please, in the end, stress that *Bayes by Backprop* is probably the most sophisticated method for Bayesian inference in deep learning models. But there exist others: having an L2 regularisation plus Dropout (or the lately invented Flipout for CNNs) gives us a Gaussian prior *p(w)* and an approximation of the intractable posterior probability distribution *p(w|D). *Read this derivation for clarifications.