Scalable Gaussian Processes: Posterior Approximation Methods

Published in

The Startup

4 min readOct 21, 2019

This article builds on the previous ones which highlighted techniques which exist in scaling Gaussian Processes. In this post we discuss Posterior Approximation Methods, which aim to utilise the existing prior, but approximate the posterior. Building on from prior approximate methods, posterior approximation methods allow the joint optimisation of parameters and hyper-parameters.

The aim in posterior approximation methods is to approximate the posterior while maintaining a full prior. We can see that the posterior inference mechanism of GP’s still requires the inversion of the kernel, and thus still has complexity of O(n³), which makes posterior approximation methods just as useful.

Approximation Methods

In order to reduce the complexity of GP’s, posterior approximation methods will aim to approximate the marginal likelihood , again, utilising m inducing points, which allows for approximation of inference. This allows for the joint optimisation parameters and hyper-parameters, and much like in the previous post on prior approximation, they utilise the Nystrom approximation of the kernel matrix.

Variational Free Energy Method — VFE

Developed by Titsias[1], we can guarantee that our approximation will have a lower limit to its approximation and thus can guarantee the degree of error. By introducing a variational distribution, we can use KL divergence between the new variational distribution and the original posterior inference distribution, and provide a lower bound. Minimising the KL divergence will allow us to jointly optimise parameters and hyper parameters.

What needs to be understood at this point, that which makes VFE impressive, is that we do not have the posterior , that is what we are trying to work out. The ELBO method allows us to get from

This is the KL divergence we are trying to minimize, where q represents are latent model, and p, our current distribution.

We can represent the above KL divergence as this, which alleviates the requirement to know p(f,fm |y), a variable which we are trying to find out.

** For the more interested reader, the manipulation of KL divergence and the development of the bounds known as ELBO, which utilise the marginal distribution instead of the unknown is quite impressive **

NOTE: To review KL divergence, click this link.

The aim is then to optimise the negative log marginal likelihood.

Objective function, depending on VFE/FITC will use the required G and T values

Qff is the Nystrom approximation of the Kernel K.

As we can see, the objective function is comprised of three terms, the complexity penalty, data fit and trace term.

In VFE, the G and T values in the objective function are given as:

The aim of the trace term is to ensure that the objective function minimizes to the lower bound of the marginal likelihood of the full GP. This essentially means that we are trying to develop a model which has the ability to infer our current data, but also is approximates the covariance structure of the full GP.

FITC

For FITC we utilise the following G and T values in the objective function.

G and T values for FITC to be utilised in the objective function

VFE vs FITC : We will make the broad remark that FITC underestimates the noise variance and VFE overestimates it. The more involved reader can refer to [2].

For both models, the complexity penalty decreases with the noise variance decrease. The diagonal term in the G value in FITC is zero at inducing points, and then reverts back to the prior variance. This can clearly be seen in the figure above, where in the FITC diagram we have ‘pinch points’ at the inducing points.

VFE’s posterior and maringal likelihood approximation become more accurate(or remain unchanged) regardless of where a new inducing point is placed. VFE however, is more susceptible to local minima and harder to optimise than FITC.

Utilising Batch Updates

One of the issues that arises at this point is that the optimisation requires all the data, and thus can be a timely process (What is the complexity). If we develop looser bounds than above, then we can utilise stochastic gradient descent to generate an unbiased estimate using batches of data[3].

We can use Stochastic Variational Inference (SVI) which uses natural gradients and can achieve complexity of O(m³). SVI can be also utilised for online learning.[4]

Bibliography

[1] — Titsias

[2] — M.Bauer, M.Wilk, C.E. Rasmussen, Understanding Probabilistic Sparse Gaussian Process Approximations

[3] D.P Kingma, J.Ba, Adam: A method for stochastic optimisation

[4] M.D.Hoffman , D.M Blei ,C.Wang, J.Paisley, Stochastic variational Inference.