Reflections on Bayesian Inference in Probabilistic Deep Learning
In my last post, we explored what possibilities we have to infer the intractable posterior probability distribution. It was rather generic, because we spoke of probabilistic models — which might be neural networks, but not necessarily.
How Bayesian inference is applied to neural networks is covered in this post.
Here, we reflect on Bayesian inference in deep learning, i.e. Bayes by Backprop.
In its quintessence, interpreting neural networks in a Bayesian perspective means to introduce uncertainty to the neural network. This uncertainty is not necessarily introduced on the model parameters, i.e. weights, how it is done in Bayes by Backprop. It can also be introduced on the number of hidden layers and hidden units, or on the activation function. These factors of potential uncertainty may be summarised as structure parameters, and learning the posterior probability distribution of those may be construed as structure learning. But there are also many more potentially uncertain factors like the number of epochs, weight initialisations, loss function, batch size, etc. which are most likely not as influential as the ones summarised as structure parameters, but could still be considered as uncertain, hence be learnt.
This might sound to some of you familiar. Indeed, learning the aforementioned factors is also known as meta-learning. Learning these factors in a probabilistic manner might be referred to as probabilistic meta-learning (although I haven’t seen anyone doing it).
Much of this is beyond the scope of this post. Basically, I started to critically reflect on Bayes by Backprop when I went into depth with Monte Carlo methods in probabilistic models (see my last post for details). I asked myself
why can we not sample from the intractable, thus unknown, posterior probability distribution directly to arrive at some approximate posterior probability distribution? Why do we must define a variational posterior probability distribution first, and then sample from that?
Sampling from an unknown probability distribution is possible, of course — that’s the purpose why Monte Carlo methods were developed. This would give us some values for weights of the neural network, we could apply backpropagation, and would be happy.
But, how successful and efficient is this sampling from a distribution which might look pretty complicated?
Let me illustrate this. You may already know that I like to compare distributions with landscapes. Look at the following two distributions in a picture of the Swiss Alps. Each distribution represents one weight’s probability (y-axis) for a given value (x-axis). Do you really think we would draw values which lead us eventually to a local optimum that gives us high prediction accuracies?
Probably not. Both distributions look complicated and we might end up in a local optimum which does not give us high prediction accuracies. We would need to take a huge number of samples, i.e. a huge number of training iterations to finally achieve a high prediction accuracy. As we all know how much time training neural networks already takes, we want to avoid this by any means.
So, what can we do now?
This step is what makes Bayes by Backprop outstanding: we define another much simpler approximate distribution, sample from it and learn the parameters of these distributions by backpropagation. This is called Laplace approximation. These parameters are very often called θ. Because the Gaussian distribution has so many known properties which make it easy to work with, we simply take a Gaussian distribution as this approximate distribution I just introduced. The parameters for a Gaussian distribution would be simply θ ={μ, σ} with mean μ and standard deviation σ.
Let’s take a step back and look at an example:
We have a supervised classification data set, e.g. MNIST. The true, but unknown posterior probability distributions of all weights would give us a prediction accuracy of 100%. But as we just said, these posterior distributions are very complicated and might never be learnt, i.e. they are intractable. When we sample from a simpler approximate distribution, we still compare our prediction, which were based on the samples from this simpler distribution over the weights, with the class of a given input image. As you probably know, a value proportional to the number of images in a my batch being classified wrongly is the calculated by the the loss function. With backpropagation, we can then adjust, i.e. learn, these samples for the weights related to this loss, which will make the loss smaller over epochs. In our case, we do not learn single point-estimates for any given weight, we learn the two parameters μ and σ of a Gaussian distribution. So, when we learn these distributions, the mean μ will naturally converge to a local optimum.
But, do you really think a Gaussian distribution could approximate any of the distributions in the Swiss Alps well?
They seem to be fairly complicated, so approximating them with a simple Gaussian distribution is generally infeasible in neural networks.
So, why do we still use Gaussian distributions then?
Are we really interested in approximating the entire distribution? Or are we only interested in approximating the most probable values, because we optimise for them anyways? The latter is correct and that’s precisely why an approximate Gaussian distribution is totally sufficient.
In the end, the values we sample to predict or classify something should come from areas very close to the local optimum of the weight distribution that gives us a high prediction or classification accuracy. Hence, a Gaussian distribution which is very similar to the true posterior probability distribution around a local optimum is all we need.
Let me know, if you have any questions, I’m happy to answer. I hope this post deepens your knowledge of Bayes by Backprop and argues for its use. Some readings I can suggest are listed below.
NeuralSpace uses probabilistic deep learning models in its products and does fascinating things with them. Check-out its latest news or try its demos by yourself.
Further reading:
MacKay, D. J. (1992). Bayesian interpolation. Neural computation, 4(3), 415–447.
Hinton, G. E., & Van Camp, D. (1993). Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory (pp. 5–13). ACM.
Graves, A. (2011). Practical variational inference for neural networks. In Advances in Neural Information Processing Systems, pages 2348–2356.
Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424.