Interpreting Posterior of Gaussian Process for Regression

Edward Elson Kosasih
Analytics Vidhya
Published in
5 min readFeb 2, 2020
Gaussian Process Illustration by scikit-learn

I recently learned about Gaussian Process (GP) and how it can be used for regression. However, I have to admit that I had a hard time grasping the concept. It was only after I derived the equations and tried going through a few samples did I manage to start deciphering what this whole idea is about.

a Gaussian process is a collection of random variables, any finite number of which have (consistent) Gaussian distributions. (Rasmussen)

Let’s start by looking at a simple problem. Suppose we observe 2 pairs of data points (x1, y1) and (x2, y2). Now given a new point x3, find out what is y3. One standard way to solve this problem is to perhaps perform linear regression i.e. fit in a line through (x1, y1) and (x2, y2), then extrapolate it to x3 and voila! we get y3. This method, however, does not inform us the uncertainty associated with that predicted value of y3.

Gaussian Process looks at the same problem through a different lens. There are some underlying assumptions behind linear regression that GP still retains though. For instance, the idea that without seeing (x1, y1) and (x2, y2), we could have probably guessed y3 to be assuming any value between -∞ and ∞ with uniform probability (if we have a non-informative prior belief). After seeing (x1, y1) and (x2, y2), however, we are more certain that some values of y3 are more plausible than the others. With linear regression, we even went to the extreme and decided that only one value of y3 was possible, that is the one that lies on the extrapolated line fitted on y1 and y2.

GP accepts this intuition, yet reframe it in a different paradigm, that is with Bayesian inference. The idea is to model each data point as a normal distribution e.g. P(y1|x1) and P(y2|x2). Regression can now be seen as a posterior inference, where we calculate the conditional probability of P(y3|(y1, y2), (x1, x2, x3)).

Recall Multivariate Normal Probability Formulation

Let’s recall how multivariate normal distribution is formulated.

We can now let A be (3) and B be (1, 2).

We can see that the posterior distribution provides a new mean and variance to P(y3) based on its’ covariance with the other two observed data points. In Gaussian Process, covariance is defined as a kernel.

Kernel is meant to represent similarity between two data points. In our case, we’ve selected a squared exponential kernel. If two data points are similar (euclidean-wise), then the kernel value will approach 1 (equality holds only if both are exactly the same). Otherwise, the kernel will approach 0 (this is asymptotic). As euclidean distance is symmetric, so is our squared exponential kernel then.

We’ll now look at each component of the posterior separately i.e. mean and variance.

Mean

We can further break this down. First substitute variance with kernel.

then perform multiplication.

this value might seem complicated but some of the values are constant/given e.g. kernel between x1 and x2.

We can see that the new mean is basically a weighted sum of the observed y1 and y2. Let’s now analyse the weight,

The more similar x3 is to x1, the larger k(x3, x1 ) hence the more contribution from y1 is considered. The more similar x3 is to x2, the larger k(x3, x2 ) thus reducing the contribution from y1. Think of this as a pair of attracting and opposing force by y1 and y2.

The new mean is a weighted sum of the past observed values. The weights act like opposing/attracting force whose magnitude depends on the kernel function.

Variance

We can further break this down

Notice that upon observing data (x1, y1) and (x2, y2), the variance for y3 decreases. This is why in typical GP chart, as soon as data points are observed, some area’s confidence interval are reduced (forming “knots”); this is a result of the reducing variance.

In the end we obtain the posterior form of p(y3) which gives a mean and variance of the range of values that y3 will likely assume. We can then obtain the confidence interval based on this mean and variance. If we’d like to get a single-point estimate like what linear regression provides, we can sample from this posterior p(y3). If we take a Maximum A Posterior (MAP) estimate, we can then use the posterior mean as an estimate.

Conclusion

We have seen how to interpret the posterior form of gaussian process as a method to perform regression. There are many different variations of GP, each with different kernel function and prior. However, the inference method stays the same.

--

--

Edward Elson Kosasih
Analytics Vidhya

PhD in Operations Research and Machine Learning at University of Cambridge