Gaussian Processes for Little Data

Published in

Panoramic

16 min readNov 2, 2018

We’ve all heard about Big Data, but there are often times when data scientists must fit models with extremely limited numbers of data points (Little Data) and unknown assumptions regarding the span or distribution of the feature space. In this post, we discuss the use of non-parametric versus parametric models, and delve into the Gaussian Process Regressor for inference. Finally, we’ll explore a possible use case in hyperparameter optimization with the Expected Improvement algorithm. This post is primarily intended for data scientists, although practical implementation details (including runtime analysis, parallel computing using multiprocessing, and code refactoring design) are also addressed for data and software engineers.

Exploring Multi-Dimensional Feature Spaces with Gaussian Processes: A Basketball Use Case

Though not very common for a data scientist, I was a high school basketball coach for several years, and a common mantra we would tell our players was to

Limit turnovers, attack the basket, and get to the line.

In layman’s terms, we want to maximize our free throw attempts (first dimension), and minimize our turnovers (second dimension), because intuitively we believed it would lead to increased points scored per game (our target output).

Let’s set up a hypothetical problem where we are attempting to model the points scored by one of my favorite athletes, Steph Curry, as a function of the # of free throws attempted and # of turnovers per game. We’ll standardize (μ=0, σ=1) all values and begin incorporating observed data points, updating our belief regarding the relationship between free throws, turnovers, and scoring ( I made these values up, so please don’t read too much into them!). A value of 1, for instance, means one standard deviation away from the NBA average. The green dots represent actual observed data points, while the more blue the map is, the higher the predicted scoring output in that part of the feature space:

In his first game, Steph had average number of free throws and turnovers, but scored a relatively high number of points. In his second game, he had lots of free throws, very few turnovers, and scored a ridiculous number of points. Notice how for most of the “unexplored” feature space regions where we have no data points, we default back to our prior belief (scoring = 0, or the NBA average).

The **blue regions** (areas with predicted high scoring output) are well defined as a result of Steph’s two strong games. Likewise, the areas of extremely low scoring output (the top left and right green dots, **dark orange**) have received posterior updates from two of Steph’s poor games. tl;dr: A relatively logical decision boundary can be formulated after four games (data points).

As Steph’s season progresses, we begin to see a more defined contour for free throws and turnovers. We see, for instance, that even when Steph attempts many free throws, if his turnovers are high, he’ll likely score below average points per page. However, after only 4 data points, the majority of the contour map features an output (scoring) value > 0. Our prior belief is that scoring output for any random player will ~ 0 (average), but our model is beginning to become fairly certain that Steph generates above average scoring- this information is already encoded into our model after just four games, an incredibly small data sample to generalize from. One of the beautiful aspects of GP is that we can generalize to any dimension data, as we’ve just seen. The code to generate these contour maps is available here.

The Foundations

So how do we start making logical inferences with as limited datasets as 4 games? One of the most basic parametric models used in statistics and data science is the OLS (Ordinary Least Squares) regression. It’s simple to implement and easily interpretable, particularly when its assumptions are fulfilled. However, the linear model shows its limitations when attempting to fit limited amounts of data, or in particular, expressive data. For instance, let’s say you are attempting to model the relationship between the amount of money we spend advertising on a social media platform, and how many times our content is shared- at Operam, we might use this insight to help our clients reallocate paid social media marketing budgets, or target specific audience demographics that provide the highest return on investment (ROI):

If our data looks like this, an OLS model fits extremely well. The parameters are **linear**, the errors are **normally distributed**, and the data points are **independently generated** and uncorrelated with the **errors** (endogeneity does not exist).

We can easily generate the above plot in Python using the Numpy library:

Here we generate 500 data points, with $ Spent between 200 and 500. We add noise that is normally distributed with μ = 0, σ = 1.

This data will follow a form many of us are familiar with. The key variables here are the βs- we “fit” our linear model by finding the best combinations of β to minimize the difference between our prediction and the actual output:

Your typical regression function, with two parameters (the bias term and coefficient) and a Gaussian-distributed error term.

Notice that we only have two β variables in the above form. What if instead, our data was a bit more expressive? Let’s say we receive data regarding brand lift metrics over the winter holiday season, and are trying to model consumer behavior over that critical time period:

Generated from a **sin(x)** function with 100 samples, noise N ~ (μ=0, σ=0.2). The orange dots are observed values. We’ll never actually see the blue line in practice, so we might consider the underlying sin(x) function to be a latent node (variable)in a graphical model.

In this case, we’ll likely need to use some sort of polynomial terms to fit the data well, and the overall model will likely take the form

Our polynomial term now has its own β parameter. What if we increase this model to have higher degree polynomials?

Instead of two β coefficients, we’ll often have many, many β parameters to account for the additional complexity of the model needed to appropriately fit this more expressive data. This “parameter sprawl” is often undesirable since the number of parameters within the model itself is a parameter, depending upon the dataset at hand.

Moreover, in the above equations, we implicitly must run through a checklist of assumptions regarding the data. For instance, we typically must check for homoskedasticity and unbiased errors, where the ε (the error term) in the above equations is Gaussian distributed, with mean (μ) of 0 and a finite, constant standard deviation σ.

Non-Parametric Models & Methods

It is for these reasons why non-parametric models and methods are often valuable. Contrary to first impressions, a non-parametric model is not one that has no hyperparameters. It is defined by the following attributes:

Non-parametric models do not make any assumptions regarding the underlying population distribution of the data. For instance, the K-Nearest Neighbors model does not assume that the data points conform to any sort of statistical distribution. However, the K-Means model assumes that each k-th cluster is Gaussian and independent, with each dimension featuring equal variance. Note that I am not making assertions that one type of model (parametric vs. parametric) model is better: the Naive Bayes model, for instance, makes a variety of assumptions regarding the independence of data points, and typically assumes Gaussian conjugate distributions in real-value regression problems.
Non-parametric model complexity grows in relation to the size of the data. If that’s a bit counterintuitive, revisit the form of a linear regression, and ask yourself if the number of β terms change when you have 10 data points versus when you have 1,000,000 data points? The answer is no. Regardless of how many data points you have, you’ll still have the same parameter set θ, where θ = {β¹, β²,β³}. However, consider a Gaussian kernel regression, which is a common example of a parametric regressor. When asked to make a new prediction (y*) for a given datapoint x*, the regressor takes a weighted average of the data points, with y* defined as

Think of this as the weighted average of all the existing data points, with points closer to x* given significantly more weight than points farther away in determining the final output y*.

K here is the kernel function (in this case, a Gaussian):

Notice that the since the numerator is the squared Euclidean distance metric, the value for K is highest when x* equals the i-th point of x. When the distance between x* and the i-th point of x increases, the output of K approaches 0. The σ² parameter in the denominator is a free (hyper) parameter of the RBF kernel. Frequently, the entire denominator is written as γ, where γ = 1 / 2σ² so that we can rewrite K as **-γ||x* — x(i)||².**

Clearly, as the number of data points (N) increases, the more computations are necessary within the kernel function K, and the more summations of each data point’s weighted contribution are needed to generate the prediction y*. If K (we’ll begin calling this the covariance matrix) is the collection of all parameters, then is rank (N) is equal to the number of training data points.

What might lead a data scientist to choose a non-parametric model over a parametric model?

Unknown or undetermined feature spaces. Kernel regressors (including Gaussian Processes) contain the machinery to handle an infinite number of dimensions (see Yaser Abu-Mostafa’s “Kernel Methods” lecture on YouTube for a primer).
The underlying data exists violates many of the core assumptions within parametric models (homoscedasticity , independently distributed data, uniform mean and variance of Gaussian-distributed residuals for the OLS model).
There is a high prevalence of outliers in the dataset. Parametric models often depend upon an objective & cost function- two of the most popular for regression are Mean Absolute Error (MAE) and Mean Squared Error (MSE), which is often utilized within optimization methods like Gradient Descent to perform stepwise updates of the model parameters (weights). These functions can be severely affected by outliers, and if model complexity is high, the model may “chase” outliers and overfit the data.

Gaussian Processes

I like to think of the Gaussian Process model as the Swiss Army knife in my data science toolkit: sure, you can probably accomplish a task far better with more specialized tools (ie., using deep learning methods and libraries), but if you’re not sure how stable the data you’re going to be working with (is its distribution changing over time, or do we even know what type of distribution it is going to be?), GPs are a great first step. Moreover, they can be used for a wide variety of machine learning tasks- classification, regression, hyperparameter selection, even unsupervised learning.

So what exactly is a Gaussian Process? In Gaussian Processes for Machine Learning, Rasmussen and Williams define it as

a collection of random variables, any finite number of which have a joint Gaussian distribution… completely specified by its mean function and covariance function.

Notice that Rasmussen and Williams refer to a mean function and covariance function. Why? A real-valued mathematical function is, in essence, an infinite-length vector of outputs. Take, for example, the sin(x) function. If we plot this function with an extremely high number of datapoints, we’ll essentially see the smooth contours of the function itself:

**(Left)** The function we are attempting to approximate, drawn from one million samples with no noise. **(Right)** The samples, including noise, that we actually observe in practice. It’s our job to unearth the underlying model behind this sample set of data points, although we know that inherently, these samples are drawn from some infinite-valued, continuous function.

The core principle behind Gaussian Processes is that we can marginalize over (sum over probabilities associated with the possible instances and state configurations) of all the unseen data points from the infinite vector (function). This leaves only a subset (of observed values) from this infinite vector. As luck would have it, both the marginal and conditional distribution of a subset of a multivariate Gaussian distribution are normally distributed themselves:

**Source**: Yuunus Satci, “Scalable Inference for Structured Gaussian Process Models”.

That’s a lot of covariance matrices in one equation! Typically, when performing inference, x is assumed to be a latent variable, and y an observed variable. We can rewrite the above conditional probabilities (and marginalize out y) to shorten the expressions into a bit more manageable format to obtain the marginal probability p(x) and the conditional p(y|x):

The marginal distribution **p(x)** and conditional distribution **p(y|x)**.

We can also rewrite the multivariate joint distribution p(x,y) as

The next step is to map this joint distribution over to a Gaussian Process. Within Gaussian Processes, the infinite set of random variables is assumed to be drawn from a mean function and covariance function. We typically assume a 0 mean function as an expression of our prior belief- note that we have a mean function, as opposed to simply μ, a point estimate. We have a prior set of observed variables (X) and their corresponding outputs (y). We are interested in generating f* (our prediction for our test points) given X*, our test features. We can represent these relationships using the multivariate joint distribution format:

**Equation 2.21** from Rasmussen & Williams. X and y are the input features and output values of the training set, and X* and f* are the input features and output distribution of the target (what we are trying to predict). The σ term incorporates the noise inherent within the distribution. Also notice the 0 mean. When working with Gaussian Processes, the vast majority of the information is encoded within the K covariance matrices. The selection of a mean function is somewhat trivial and 0 is selected for simplicity.

Each of the K functions deserves more explanation. Given n training points and n* test points, K(X, X*) is a n x n* matrix of covariances between each test point and each training point. Similarly, K(X*, X*) is a n* x n* matrix of covariances between test points, and K(X, X) is a n x n matrix of covariances between training points, and frequently represented as Σ. For the sake of simplicity, we’ll use the Radial Basis Function Kernel, which is defined below:

The **radial basis function (RBF)** kernel. I’ve removed several of the hyperparameters commonly used to tune the kernel for simplicity’s sake, for now.

In Python, we can implement this using NumPy. Note: In this next section, I’ll be utilizing a heavy dose of Rasmussen & Willliams’ derivations, along with a simplified and modified version of Chris Follensbeck’s “Fitting Gaussian Process Models in Python” toy examples in the following walkthrough.

Note the use of np.subtract.outer() instead of simply np.subtract. If both x_1 and x_2 are arrays of length n, we want a matrix of shape n x n.

Since the 3rd and 4th elements are identical, we should expect to see the third and fourth row/column pairings to be equal to 1 when executing RadialBasisKernel.compute(x,x):

array([[ 1.        ,  0.60653066,  0.13533528,  0.13533528],
       [ 0.60653066,  1.        ,  0.60653066,  0.60653066],
       [ 0.13533528,  0.60653066,  1.        ,  1.        ],
       [ 0.13533528,  0.60653066,  1.        ,  1.        ]])

Now, we can start with a completely standard unit Gaussian (μ=0, σ=1) as our prior, and begin incorporating data to make updates. Let’s say we pick any random point x and find its corresponding target value y:

x = 2.3
X = np.array([x]) # initial training feature space value
y = .98
Y = np.array([y]) # initial training output value

Not the difference between x and X, and y and Y: x is the individual data point and output, respectively, and X and Y represent the entire training data set and training output sets. The function to generate predictions can be taken directly from Rasmussen & Williams’ equations:

The output for the Gaussian Process is not a direct point estimate. It is a tuple- an expected value of all function outputs given training data **(X)**, training outputs **(y)**, and the new test point **(X*), and** a new covariance matrix (or simply a scalar variance value in the case of a univariate Gaussian output) called **cov(f*)**.

In Python, we can implement this conditional distribution f*. I followed Chris Fonnesbeck’s method of computing each individual covariance matrix (k_x_new_x, k_x_x , k_x_new_x_new)independently, and then used these to find the prediction mean and updated Σ (covariance matrix):

Python code to generate new a new prediction mean and covariance for a particular test point (x_new), given existing training data (X) and training outputs (y).

The core computation in generating both y_pred (the mean prediction) and updated_sigma (the covariance matrix) is inverting np.linalg.inv(k_x_x). You might have noticed that the K(X,X) matrix (Σ) contains an extra conditioning term (σ²I). There are times when Σ is by itself is a singular matrix (is not invertible). A simple, somewhat crude method of fixing this is to add a small constant σ² (often < 1e-2) multiple of the identity matrix to Σ.

Now, let’s use this function to generate (100) predictions on the interval between -1 and 5:

Notice how the uncertainty of the model increases the further away it gets from the known training data point (2.3, 0.922). Intuitively, in relatively unexplored regions of the feature space, the model is less confident in its mean prediction. Like other kernel regressors, the model is able to generate a prediction for distant x values, but the kernel covariance matrix Σ will tend to maximize the “uncertainty” in this prediction. If we represent this Gaussian Process as a graphical model, we see that most nodes are “missing values”:

**Figure 2.3** from Rasmussen and Williams. Squares represent observed variables (note that y* is not a square). Circles represent unknown (latent) variables. Note that in this graphical model, **y(1)** is conditionally independent of all other outputs given the latent variable **f(1)**- adding other variables (observations) simply adds more latent variable **f(x)s**- it does not change the underlying distribution of **y(1).**

This is probably a good time to refactor our code and encapsulate its logic as a class, allowing it to handle multiple data points and iterations. We’ll put all of our code into a class called GaussianProcess. We’ll also include an update() method to add additional observations and update the covariance matrix Σ (update_sigma). I compute the Σ (k_x_x) and its inversion immediately upon update, since it’s wasteful to recompute this matrix and invert it for each new data point of new_predict().

An important design consideration when building your machine learning classes here is to expose a consistent interface for your users. For instance, while creating GaussianProcess, a large part of my time was spent handling both one-dimensional and multi-dimensional feature spaces. This led to me refactoring the Kernel.get() static method to take in only 2D NumPy arrays:

As a gentle reminder, when working with any sort of kernel computation, you will absolutely want to make sure you vectorize your operations, instead of using for loops. When computing the Euclidean distance numerator of the RBF kernel, for instance, make sure to use the identity

**Source:** StackOverflow, Vectorizing radial basis function’s euclidean distance calculation for multidimensional features (see accepted answer).

This identity is essentially what scikit-learn uses under the hood when computing the outputs of its various kernels. Here’s a quick reminder how long you’ll be waiting if you attempt a native implementation with for loops:

This is for 1 (one!) computation of a 100 x 100 covariance matrix. Imagine doing this computation across the entire feature space!

Now, let’s return to the sin(x) function we began with, and see if we can model it well, even with relatively few data points. We’ll randomly choose points from this function, and update our Gaussian Process model to see how well it fits the underlying function (represented with orange dots) with limited samples:

I wish to explore the feature space between 1 and 10, so I create a prediction_interval iterable. Then, I declare start variable indexer and randomly sample from my prediction_interval to simulate receiving a random sample of data points. I call the GaussianProcess class’ update() method to ingest new data points.

In 10 observations, the Gaussian Process model was able to approximate relatively well the curvatures of the sin(x) function. The areas of greatest uncertainty were the specific intervals where the model had the least information (x = [5,8]).

Coupling Gaussian Process with Expected Improvement (EI) Algorithm

What else can we use GPs for? Let’s say we are attempting to model the performance of a very large neural network. It takes hours to train this neural network (perhaps it is on an extremely compute-heavy CNN, or is especially deep, requiring in-memory storage of millions and millions of weight matrices and gradients during backpropagation). Long story short, we have only a few shots at tuning this model prior to pushing it out to deployment, and we need to know exactly how many hidden layers to use. In other words, the number of hidden layers is a hyperparameter that we are tuning to extract improved performance from our model. What are some common techniques to tune hyperparameters?

We could perform grid search (scikit-learn has a wonderful module for implementation), a brute force process that iterates through all possible hyperparameter values to test the best combinations. This works well if the search space is well-defined and compact, but what happens if you have multiple hyperparameters to tune? If you have just 3 hyperparameters to tune, each with approximately N possible configurations, you’ll quickly have an O(N³) runtime on your hands. There’s random search, but this is still extremely computationally demanding, despite being shown to yield better, more efficient results than grid search.

Even without distributed computing infrastructure like MapReduce or Apache Spark, you could parallelize this search process, of course, taking advantage of multi-core processing commonly available on most personal computing machines today:

This code prevents a sample implementation of a 4-process grid search. The hyperparameter space to search is from 1 to 40. The results are saved to a return_dict global dictionary, which can be easily flattened and reduced to find the argmax(hyperparameter).

However, this is clearly not a scalable solution- even assuming perfect efficiency between processes, the reduction in runtime is linear (split between n processes, while the search space increases quadratically.

To solve this problem, we can couple the Gaussian Process that we have just built with the Expected Improvement (EI) algorithm, which can be used to find the next proposed discovery point that will result in the highest expected improvement to our target output (in this case, model performance):

**Source**: Thomas Huijskens, “Bayesian Optimization with scikit-learn”.

This is intuitively stating that the expected improvement of a particular proposed data point (x) over the current best value is the difference between the proposed distribution f(x) and the current best distribution f(x_hat). It represents an inherent tradeoff between exploring unknown regions and exploiting the best known results (a classical machine learning concept illustrated through the Multi-armed Bandit construct).

We can calculate this as

**Source**: Thomas Huijskens, “Bayesian Optimization with scikit-learn”. The Z value is the standard normal Z score, specifying the distance from the Gaussian center in terms of standard deviations. **Uppercase** **Φ(Z)** is the Gaussian cumulative distribution function (CDF), and lowercase Φ(Z) is the Gaussian probability distribution function. **μ(x)** is the expected target output (predicted mean) for a particular test point, and **f(x)** is the best known output so far.

In our Python implementation, we will define get_z() and get_expected_improvement() functions:

We can now plot the expected improvement of a particular point x within our feature space (orange line), taking the maximum EI as the next hyperparameter for exploration:

The highest regions of Expected Improvement tend to lie in feature space regions with greatest uncertainty.

Taking a quick ballpark glance, it looks like the regions between 1.5 and 2, an around 5 are likely to yield the greatest expected improvement. If we’re looking to be efficient, those are the feature space regions we’ll want to explore next.

Limitations, Runtime Complexity, & Memory Usage

It’s important to remember that GP models are simply another tool in your data science toolkit. They come with their own limitations and drawbacks:

Gaussian Process models are computationally quite expensive, both in terms of runtime and memory resources. Off the shelf, without taking steps to approximate the covariance matrices, they work best (and arguably only) with relatively limited number of data points. Let’s revisit the conditional distribution used for making predictions:

The bottleneck (both in terms of computation and memory usage) can be found in the **K(X,X)** matrix and its inversion.

Both the mean prediction and the covariance of the test output require inversions of K(X,X). Since this is an Nx N matrix, runtime is O(N³), more specifically O(N³/6) using Cholesky decomposition instead of directly inverting the matrix ,as outlined by Rasmussen and Williams. Memory requirements are O(N²), with the bulk demands coming again from the K(X,X) matrix. There are a variety of algorithms designed to improve scalability of Gaussian Processes, usually by approximating K(X,X) matrix, with rank N, to a smaller matrix of rank P, with P significantly smaller than N.

Despite being theoretically non-parametric, GPs parameterize the uncertainty of the output values y and y* as Gaussian. Kernel methods like GP require data scientists to specify a prior distribution, encoded within the kernel selected and kernel hyperparameters. This allows us to embed domain/export knowledge into our initial belief distributions, but it also often“locks-in” our posterior update to be of the same form. For example, if your output values are bounded to be positive, or between a certain interval, this Gaussian model will serve only as a stopgap approximation of the true distribution (see this StackExchange post for a broader explanation).
The possibility of numerical instability increases relative to the number of data points. For instance, you might observe GP behavior similar to this as you increase the number of data points you feed in:

At the edges of the feature prediction space, the GP quite commonly extends its predictions→ +∞ or -∞. This situation can be avoided by proper hyperparameter tuning of the kernel, low-rank decomposition (see below), or adding additional noise condition to the covariance matrix.

One technique for addressing this instability is to perform a low-rank decomposition of the covariance kernel.

I’ve only barely scratched the surface of the applications for Gaussian Processes. They’re used for a variety of machine learning tasks, and my hope is I’ve provided a small sample of their potential applications!