A Kernel for Mapping with Gaussian Processes

Published in

dataswati-garage

9 min readDec 14, 2018

I love maps, and the challenge of getting the perfect representation of Earth distances, angles and surfaces on the flat surface of the screen. At Dataswati, I experiment with mapping some feature given discrete measurements. To do so, I started using kriging, and eventually gaussian processes. Sadly, the gaussian process algorithm from sklearn misses some features specific to mapping upon the Earth, and this void must be filled.

Here is the description of the problem: I have measurements of some data (such as temperature, ore concentration, or radioactivity level) over coordinates s=(x,y) or s=(latitude,longitude). From this sampling, I want to map an estimation of this data over a selected area. I also want to access the level of confidence of this estimate, as the uncertainty of the estimate sheds light on the location of the next measurement.

Kriging and gaussian processes both allow to perform such interpolations using Bayesian formalism, and it follows that the uncertainty can be estimated.

Kriging and gaussian processes

Kriging was created by Krige (1951) specifically for mapping ore concentration from a discrete sampling. It was popularized in the 60’s by the French geophysicist Matheron (1962). His solution is in fact a 2D/3D special case of gaussian processes, which generalize the modeling in other dimensions. Gaussian processes are presented as a generalization of the gaussian probability distribution over functions by Rasmussen and Williams (2006).

Both gaussian process and kriging methods consider the measurement data as the result of a stochastic process: y=f(x)+ε where f is the unknown real function we want to model, and ε~N(0,σ^2) is the noise. The simplest example is of course a 1D function that is irregularly sampled and have to be interpolated. The main idea behind these methods is that the co-variation of the features is smooth: the value of the predicted data nearby sampled coordinates is supposed to be close to the value at these coordinates.

A kriging model acts as an exact interpolator: the prediction equals the measurement at any observed point, ignoring the measurement uncertainty. With gaussian processes, the noise level in the observation is taken into account via a parameter a that allows the predicted model to pull away from the exact value. The optimal value for this parameter would be a=σ (corresponding to the amount of noise in the measurement).

The kernel (also called covariance function) gives the “spatial” dependency of two data points. In kriging, the kernel is represented by a semi-variogram, defined as the average squared difference of the measured values yi and yj between two points separated by a distance h:

where N(h) is the number of points that are separated by the distance h. The semi-variogram extracts real information about the structure of the data. This kernel is then used to interpolate the model in the desired area.

With gaussian processes, assumptions are made on the shape of the kernel, and the parameters of this kernel are learned from the data during the training stage. There are different kinds of kernels and I am currently experimenting with kernels optimized for mapping at a large scale.

Kernels

As described above, the kernel functions return the covariance between the outputs given the inputs. Sklearn implements many different kernels well described in their documentation. Here is their basic kernel: the Radial-basis function (RBF) kernel, also known as squared exponential kernel, is defined as:

Where d(xi,xj) is the Euclidean distance between xi and xj, and l is the length scale of the kernel, that gives the level of smoothness of the final model. This formula is correct when your space is Euclidean. If the area you are modeling is small enough, you can consider that your coordinates s=(x,y) define such an Euclidean space. However, this approximation is incorrect when you are mapping, let’s say, Europe.

Comparison of Europe maps in the World Geodetic System and in Lambert 93 coordinates system.

You can see that the left map looks weirdly flattened. This is due to the projection in the World Geodetic System. In the Lambert 93 projection, distances and shapes are altered.

Given the coordinate system, mapping using gaussian processes results in very different structures. Even worse, it is missing out large structures behind the data. To bypass this problem, we need to change the distance d(xi,xj) to a distance independent of the projection, such as great-circle distance (Haversine formula), or a more accurate distance evaluated on the surface of a spheroid (Vincenty’s formula).

Using one of these two distances, we also need to change the kernel to:

As scaling the axis with l before calculating the distances doesn’t make any sense.

Mapping with gaussian process

To test this modified kernel, I’ve used a model to generate synthetic training data defined over the map of Europe. From these measurements, I build a predictive model of this synthetic dataset using gaussian processes. With this process, it is possible to compare the predicted map with the “real” map.

The synthetic model is generated randomly using pyshtools, a tool that allows to build a random model with spherical harmonics (a set of orthogonal functions on the sphere). Our model needs to be defined on a sphere to be coherent with our new kernel optimized for the sphere. The synthetic model, represented below, is actually the random model built in the first pyshtools tutorial.

A random model build using spherical harmonics up to the degree 600. At degree 600, the map has a characteristic length of 60 km (0.6°).

Random points are sampled from the model, only on the continent as it is often the case in real datasets. I did not include for noise in this dataset to examine the best case scenario. This sampling implies that the output model is going to be smoother than the initial model. For every sampled point, the distance to the closest point is on average ~120 km. Thus, we can’t expect to retrieve a structure smaller than this characteristic length. We therefore attempt to map the broader variations.

Some random sample points from previous model.

The advantage of gaussian processes is the formulation of a ready to use formalism for machine learning. The algorithm of Rasmussen and Williams (2006) is fully implemented in sklearn. In our case, we are interested in regression.

Let’s take a look of what happens if we use RBF kernel without any modification. The figure below shows the map obtained with optimized RBF kernel.

Predicted map generated using gaussian processes and the RBF kernel.

The kernel always ends up with a really small length scale of l= 170 km. It makes sense that it can’t find the largest structure reliably, as for points standing at the same longitude, a higher latitude implies closer points.

Map of the *a posteriori* error and map of the residuals for the model built with the RBF kernel.

Here are the maps of the a posteriori error returned by the optimizing algorithm, and the map of the residuals (i.e. the difference between the “real” map and the predicted map) over the same area. With this small length scale, it is impossible to recover the broad structure of the map where the density of points is really low. In these regions, the a posteriori error is high as well as the residuals.

Some other kernels (such as the ones using Euclidean distance) give better results than RBF. For example, rational quadratic kernel can find a broader structure with a length scale of l= 427 km.

I propose a modified version of RBF kernel, where it is possible to add a personalized distance metric. With this kernel, I tried two distances:

The distance metrics from geopy which return the distance between two points given their longitude and latitude.
The great-circle path distance computed with the basic Haversine formula in this function. I ended up using this function because the computation of the geopy distance takes too long for gaussian process regression.

Here is the results obtained with one of the best models obtained using this kernel.

Map generated using gaussian processes and the modified kernel. The metric used with this kernel is the great-circle distance that return the distance between two points given their longitude and latitude.

Did you notice how well the model is retrieved below the North Sea, the Norwegian Sea and the Mediterranean Sea? The kernel finds consistency in the broad structure of the map. This consistency reinforces the a posteriori error, and the residuals drastically decreased.

Map of the *a posteriori* error and map of the residuals for the model built with the corrected kernel and the great-circle distance.

To finish, I calculated a score upon the residuals using different setups. The score is the mean squared error of the residuals calculated over the whole map:

Where nΩ is the number of points of coordinates s on the map Ω, y is the real model and ŷ is the model obtained with gaussian processes. I ran the optimization using four different kernels:

RBF kernel from sklearn with default distance (Euclidean distance),
Rational quadratic kernels from sklearn with default distance (still Euclidean distance),
The modified RBF kernel with geopy distance,
The modified RBF kernel with the great-circle distance.

I also test different values for a, that gives the expected noise level in the observations.

Mean Squared error of the residuals modeled with RBF kernel, with the rational quadratic kernels from sklearn, with the modified RBF kernel using geopy distance and with the modified RBF kernel using the great-circle distance. Colors of the dots correspond to the setup used for the level of noise.

The best scores are obtained when the distances on the sphere are taken into account. When the noise level is set too high, the model returns flat and the score decreases. When the level of noise is 0, the sampled data are exactly retrieved using one of the distances calculated over the sphere, whereas using the rational quadratic kernel give worse result when we decrease the noise.

It makes perfect sense to use this adapted distance when mapping a large area over the Earth. Of course, with smaller region, using rational quadratic distance is going to give neat results. But isn’t it amazing how easy it is to obtain accurate mapping upon the Earth using this simple setup.

To be continued

Here are some leads on how I want to continue this work:

First, I need to modify the distance function to decrease the computing time. For now, it may still be more interesting to use the rational quadratic kernel because the use of modified kernel is too time consuming.
Next, the two metrics I presented only allows to model from two features: latitude and longitude. I am seeking a way to include some other dimension to this kernel (such as time for example).
Finally, the kernel does not include any option to deal with anisotropy (while it exists for sklearn kernels). On the sphere, I think it is possible to use the formalism of seismic anisotropy to construct a more general kernel for gaussian processes.

Keep tuned!

References

Krige, D. G. (1951). A statistical approach to some basic mine valuation problems on the Witwatersrand. Journal of the Southern African Institute of Mining and Metallurgy, 52(6), 119‑139.

Matheron, G. (1962). Traité de géostatistique appliquée. Mémoires du Bureau de Recherches Géologiques et Minières (Editions Technip., Vol. 1). Paris.

Rasmussen, C. E., & Williams, C. K. I. (2006). Gaussian processes for machine learning. Adaptive computation and machine learning. Cambridge, Mass: MIT Press.