Spatial Reasoning in Random Field Models: Part I

Rachel Prudden
Met Office Informatics Lab
8 min readMay 19, 2021

Gaussian processes (GPs) are a probabilistic model used for a wide range of ML and stats. They appear in many different contexts including model emulation, spatial interpolation, Bayesian optimisation, and generative modelling. In a spatial context, they are often referred to as Gaussian random fields.

Most of the time, like any other ML method, the input data used to train¹ these models consists of a bunch of isolated examples. Having access to the function values at these input points, the challenge is then to predict the values at unobserved locations. So far, so familiar.

But these models are more flexible than they first appear. A part of my PhD research has been looking into what happens when these models are used with fundamentally different kinds of observations, such as integrals and spatial averages. This turns out to be a pleasingly natural extension, and what’s more it gives rise to some very beautiful (and useful!) emergent effects.

I’m going to try and illustrate a few of these effects in this short two-part series of blog posts. In this first post, I’ll give some background on the mathematics behind these models and show how they can be combined with spatial averages for certain kinds of downscaling / super-resolution. In Part II, we’ll see how combining point observations with spatial averages can make things even more interesting.

Gaussian processes

The intuition behind Gaussian processes is: they give you a distribution over the space of functions. Every sample you draw from the distribution is a function. The properties of the function, such as its smoothness, are encoded in the parameters of the distribution, through its covariance.

Samples from Gaussian processes and Gaussian random fields with different lengthscale parameters (left) and example covariance matrices (right).

There are many excellent resources which explain the basics of Gaussian processes, so I won’t go into much detail here. If the ideas are new to you, I would definitely recommend watching an introductory talk like this one. Another great place to start is this article on distill, which walks through the mathematics with interactive examples.

As their name suggests, Gaussian processes are built on Gaussian distributions. In fact, if you take a finite² Gaussian process defined over a grid, it is just a multivariate Gaussian distribution, defined by a mean vector and a covariance matrix. What makes them special is the form of the covariance matrix, which reflects the spatial properties of the functions, including their dimension (see the image above). This is easiest to see for the one-dimensional case: the covariance peaks along the diagonal and then drops away, as each point is most closely correlated with its near neighbours and less so with those further away.³

The useful properties of Gaussian processes are also inherited from Gaussian distributions. The most important are:

Conditioning

Knowing the function values at certain points, we would like to get the conditional distribution over functions that match these observations. We can do this by conditioning the Gaussian process on our observations, which conveniently comes down to some matrix algebra:

for the conditional mean and covariance respectively, where t and o denote the target and observed locations, and obs is a vector containing the observed values.

Inference

Given a Gaussian process and an observed function, it’s useful to be able to calculate its likelihood as a function of the GP parameters. Taking gradients through this likelihood then allows us to infer the model parameters based on the observations. The (log) likelihood also reduces to matrix algebra:

where K is a parameterised covariance matrix, and x is the observed data.

Sampling

Finally, if we have a Gaussian process defined by its mean and covariance, we’d like to be able to treat it as a generative model and draw samples. This reduces to… you’ll never guess…

yes, more matrix algebra! Here w is a vector of white noise, and the half power denotes a matrix square root (or usually the Cholesky decomposition, for efficiency).

That’s about it for Gaussian processes. Before we move on, a bit of nomenclature. Gaussian processes which have an interpretation as spatial fields (instead of being defined over time or some more abstract space) are often called Gaussian random fields or GRFs. Mathematically, it’s really all the same.

Spatial conditioning

I mentioned above that GPs can be used with different kinds of observations, such as spatial averages. So, how does this work? Once again, it comes down to the nice properties of the Gaussian distribution, specifically the fact that the covariance is linear. This means it’s possible to derive the covariance of spatial averages from the covariance of the original field. What we end up with is a distribution relating the spatial averages to the original variables. Once we have this, it’s easy to condition on spatial averages using exactly the same techniques described in the previous section.

This is still sounding quite abstract, so let’s look at an example. We’ll stick to a single dimension for now, to make the interpretation easier. Our data just consists of a series of values, shown below. We’ll use this data to see the difference between conditioning on point values and on spatial averages by using it as input to two different models. The first model treats the data as point observations, while the second treats them as observations of spatial averages. In all other respects the models are identical; both use the same prior distribution with zero mean and unit variance, and the same lengthscale parameter.

Input data for our example. Dots indicate an interpretation as point observations, horizontal lines indicate an interpretation as spatial averages.

When we condition the first model on the input data, treating the observations as point values, we get something like this:

Distribution for model using point observations.

The distribution is constrained to pass exactly through each of the observed points, so the variance drops to zero at these points. Midway between observations the variance is much higher, and the distribution is more influenced by its prior. A similar picture arises from plotting samples from the distribution: each passes exactly through the marked points, but elsewhere they can be quite divergent.

Samples from model using point observations.

How about our other model? In this case, there is no sharp decrease in variance since the observations are no longer located at a point. Instead, the variance stays fairly close to uniform throughout.

Distribution for model using spatial observations.

Similarly, if we look at samples from this model it’s clear that they are no longer constrained to go through the points marked with red circles, as in the previous model. Some samples go through them by chance, while others miss entirely.

Samples from model using spatial observations.

Since this model does not constrain the samples to pass through some specified points, does it constrain some other property? The answer is yes, but it’s less easy to see from these plots. To see it more easily, let’s take a look at the spatial averages of the samples from each model. The samples from the model using point observations are shown in pale green, and the samples from the model using spatial observations are shown in dark green:

Spatial averages of the samples from the model using point observations (light green) and spatial observations (dark green).

Now it’s clear that the spatial averages of samples from the second model agree exactly with each other, and with the input observations. By contrast, the samples from the first model have quite divergent spatial averages.

It’s clear that how we choose to interpret this input data can have a significant impact on our resulting distribution. But why? We saw above that conditioning a Gaussian process comes down to just some linear algebra. How does this linear algebra force our models to satisfy these different constraints?

One way to gain some insight into this question is to examine the covariance matrices. An advantage of working in one dimension is that these matrices are quite easy to interpret: each axis of the covariance matrix corresponds to the single spatial dimension, with the variance for each point along the diagonal. By examining the posterior covariance for each model, we can see how they encode the constraints we have imposed.

Posterior covariances for model using point observations (left) and spatial observations (right).

Consider the arrangement of positive, negative, and zero covariance values in each case. One notable feature is the presence of zero values on the diagonal of the point-based model, corresponding to the observed points. By contrast, the spatial model has positive values everywhere on its diagonal. This matches up neatly with our previous discussion.

Other features show up in the covariance matrices that would be difficult or impossible to see in our earlier visualisations. One example is the appearance of a positive-negative “dipole” around each observation point in the covariance of the point-based model. This feature can be seen as imposing smoothness of the function through the fixed point: if the function is higher on one side, it must be lower on the other to maintain continuity.

For the spatial model, the negative values in the covariance follow a noticeably different pattern. In fact, they follow closely the boundaries of the blocks over which we have taken the spatial averages. In this case the meaning is not to impose continuity, but to constrain the spatial average over these areas to match the observation. If there is a high value in one part of a block, the function elsewhere in that block must take on lower values to compensate, and vice versa. This is a theme we’ll pick up again in Part II of this series.

Wrapping up

That’s it for this post. We’ve gone through some background on Gaussian processes, and seen how conditioning them on spatial averages instead of point observations can change the properties of our model. In the next post, I’ll discuss combining spatial averages with point observations and how this leads to some nice emergent properties.

If you’d like to learn more about the ideas in this post there is now a preprint available which goes into greater detail and includes an application to super-resolution of weather fields. You can also watch my recent talk.

Notes

[1] In fact, the normal distinction between training a model and zero-shot learning gets a bit tricky here. If the input data to a GP represents house prices, it looks a lot like training a regression model; if the data represents pixel values, it has more of a zero-shot learning flavour.

[2] A nice feature of Gaussian processes is that you can also do the same things with continuous functions and ungridded observations, but we don’t need that here.

[3] The covariance matrix is usually taken to follow a simple parameterised form known as a kernel — this simplifies model inference, as only a small number of parameters need to be estimated.

--

--

Rachel Prudden
Met Office Informatics Lab

Rachel is a researcher in the Informatics Lab. Her current focus is on probabilistic super-resolution of weather models at convective scales.