[Paper Readthrough] — Variational Bayesian Monte Carlo

Nicola Bernini

Published in

Discussing Deep Learning

3 min readMay 16, 2019

Overview

The challenge of this summary is to be able to explain a complex paper in a simple way, with

super-minimal math required (pretty hard in this case)
a layered structure, so the reader can be exposed to more and more details at his/her discretion

Original Paper: https://arxiv.org/abs/1810.05558?fbclid=IwAR2Irobgi5jW_RVJL4iRyv0QS_WfmgzKk-KmyoeXSZf3X26eFkz8gaTRxk4

Variational Bayesian Monte Carlo

Many probabilistic models of interest in scientific computing and machine learning have expensive, black-box…

arxiv.org

TL;DR

Performing Bayesian Inference has a lot of very important practical applications
It mean estimating computing the Posterior and the Marginal Likelihood or Evidence
Unfortunately this is intractable in general hence it is necessary to compute approximations for both
In general there are 2 approaches: Variational Methods and Monte Carlo Methods trading off between knowledge of function (access to derivatives) or sample efficiency
Sampling Efficiency is a key feature to be able to practically do it
MCMC is a standard tool for Bayesian Inference but it is not super sampling efficient
This paper introduces VBMC as a new sampling efficient Bayesian Inference tool

Some Details (Level 1)

The Problem

Let’s say we have a parametric model and given a Dataset we want to estimate Probability Density Function (PDF) of each of its params

In this case we have \Theta as the full params space

This problem is theoretically modeled in the Bayesian Framework and its solution consists of performing Bayesian Inference for which some tools exist like MCMC, however there are some problems reducing its straightforward use in practice, essentially related to the Likelihood

Likelihood

In general, let’s consider the Model Likelihood as a complex “black box” function (for example, imagine it’s a big Neural Network)

So the only way to know something about it is to sample it (like a black box, provide input and observe output)

Ideally with an unlimited evaluations budget we could reconstruct this function with arbitrary precision, however in practice the evaluation budget is limited, compared to the Likelihood complexity, hence it needs to be approximated using samples efficient approximation techniques

Realistic Likelihoods are hard:

high dimensional
Multi-Modal (many min and max)
Heavy Tails (extreme events)
Correlated Parameters

Tools

Variational Approach
Active Sampling
Gaussian Process

Variational Approach — Core Idea

Substitute the complex True Posterior with simpler PDF, called Variational Posterior, fitting its parameters in a dissimilarity minimization framework, using a proper PDF similarity measure (e.g. KL Distance)

Variational Inference is performed via Optimization and solving it gives 2 results

the Posterior Approximation
the Evidence Approximation (ELBO)

Active Sampling

Consists of having an algorithm to make a smart choice about how to sample the unknown function (likelihood), in order to maximize some criteria while fitting the samples budget

In Math terms, the algorithm essentially consists of solving a specific optimization problem targeting the acquisition function: a function which connects the sample selection with the chosen criteria (e.g. posterior variance minimization)

More Details (Level 2)

ELBO

The Evidence Lower Bound (ELBO) has a prettSomey self explanatory name: it is the lower bound for the true Evidence (in the Bayesian jargon it is Bayesian Formula denominator)

The True Evidence is defined as the sum of ELBO and KL divergence between True and Variational Posterior so, as a result of the fact KL is always non negative by construction, we have that

ELBO is the lower bound for Evidence
the more the Variational Posterior is similar to the True one, the better approximation ELBO is for Evidence

Variational Approach in Physics

Variational Approach has been used in various branches of Physics like Statistical Mechanics to perform analytic computation of posteriors

To make the approximation problem analytically tractable the trick is in the choice of Variational Posterior: e.g. choosing a factorized function, the logarithm application results in a sum and this can possibly lead to closed form solutions

This of course comes at the cost of the KL divergence (its value is as high as this easy approximation is different from the true one)

Work in progress (new updates coming soon)