Gaussian Process
Introduction
Let’s start with some interesting properties of Normal distribution, these properties will later help us in deriving certain Gaussian Process equations. Assuming X and Y are two random variables that follow a normal distribution, then the following also holds true
Then following are also a normally distributed.
1. X+Y
2. Joint distribution P(X=x, Y=y)
3. Conditional P(X=x| Y=y)
4. Marginalization
5. Multiplication of X and Y
These properties are extensively used in many supervised algorithms for parameters estimation using MLE (Most Likelihood Estimate) and MAP (Maximum A Posteriori), let's look at the difference between them, In MLE we are trying to find parameters that are most likely to generate the observed data (P(D|w)), whereas in MAP we also take care of the prior distribution of parameters in addition to MLE (P(w|D)). MAP seems to be more reasonable than MLE since we are giving some weightage to our prior belief.
MLE: P(D| w)
MAP: P(w|D) ∝ P(D|w) x P(w)
where D represents training data (x, y) pairs and w are parameters
Intuition
Gaussian Process is a bayesian based supervised algorithm that gives us predictions with some confidence interval, It is also a non-parametric method. In GP we try to learn the distribution of functions P(f(x)) where x is any unobserved data or test data.
Let us assume x is a test point, y be corresponding unknown prediction and D be training dataset, then we are trying to get an estimate of y which is P(y| x, D). This can be rewritten as the marginalization of P(y, w| x, D)
We can see P(y| x, D) is the marginalization of P(y, w| x, D) with respect to w, therefore the prediction value at x is weighted prediction over all possible vectors w. Let us try to understand equation 3 intuitively.
The first part P(y|x, w) is representing the distribution of y given weights(w) and features(x). The second part is referring to the prediction weightage given the training dataset and how probable are the weights.
In summary, we are going over all the weights possibilities, getting their corresponding influence on our test data (P(w|D)), and their prediction (P(y|x, w)). P(y|x, D) follows a normal distribution, this can be proved using product properties mentioned in the Introduction section.
P(y| x, D) follows Normal Distribution
This can be proved using equation 3, P(y| x, w) follows normal distribution (see fig 2 for reference, for simplicity we have considered a linear model)
P(w| D) also follows Normal distribution, this can be also proved as P(D|w) is also normal distribution and if we assume P(w) as normally distributed, P(y|x, D) follows Normal distribution
Look at the below figure (1), each individual curve is representing a different function based on some weights, each passing through the observed training data(black dots).
Mathematical Representation
Let us assume y2 is prediction at x2, y1, and x1 are observation points.
The joint distribution P(y1, y2) is given by below equation
Conditional Distribution of P(y2 | y1, x1, x2)
We use the conditional distribution mentioned in Fig 5, for inference on test data. There are several options to select the Kernel function for measuring covariances.
Kernel Function
The Kernel function is used to define covariance between X and X*, As we know covariance is the measure of the relationship between two random variables, a +ve covariance means the random variables tend to move in the same direction while -ve covariance means the random variables tend to move in the opposite direction. Any function which can satisfy some properties can be a kernel function. You can find these properties here.
Suppose k is a kernel function then it should satisfy following conditions
1. k(x, y) = f(x) . f(y)
2. k is also symmetric from property 1, k(y, x) = f(y) . f(x)
3. k is kernel function if and only if Kernel matrix (K) should be +ve semidefnite matrix
Some commonly used kernel functions are the Rational Quadratic Kernel, and Radial Basis Function.
Python Implementation from Scratch
# importing relevant python libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from scipy.spatial.distance import cdist
def model(x):
return x * np.sin(x)
# simulating some data points from
X_train = np.array([1, 3, 5, 6, 8]).reshape(-1, 1)
y_train = model(X_train)
X_test = np.linspace(0, 10, 100).reshape(-1, 1)
# Defining rbf kernel function
def kernel_function(X1, X2, length_scale=1):
pairwise_distance = cdist(X1, X2, metric='sqeuclidean')
return np.exp(-pairwise_distance/(2*length_scale**2))
# generating kernel matrix(covariance matrix) for training & test data
sigma_11 = kernel_function(X_train, X_train)
sigma_12 = kernel_function(X_train, X_test)
sigma_22 = kernel_function(X_test, X_test)
sigma_21 = kernel_function(X_test, X_train)
# mean vector
mean_1 = np.mean(X_train, axis=1).reshape(-1, 1)
mean_2 = np.mean(X_test, axis=1).reshape(-1, 1)
# mean vector & covariance matrix for test data
mean = mean_2.flatten() + \
sigma_21@np.dot(np.linalg.inv(sigma_11), (y_train - mean_1)).flatten()
sigma = sigma_22 - sigma_21@(np.linalg.inv(sigma_11))@sigma_12
sigma = np.sqrt(np.diag(sigma))
# plotting test & train data
plt.plot(X_train, y_train, 'or')
plt.plot(X_test, mean, '-', color='orange')
plt.plot(X_test, model(X_test), '-', color='blue')
plt.fill_between(X_test.flatten(), mean - 2*sigma, mean + 2*sigma,
color='gray', alpha=0.2)
pop_a = mpatches.Patch(color='red', label='training dataset')
pop_b = mpatches.Patch(color='blue', label='test label')
pop_c = mpatches.Patch(color='orange', label='test data prediction')
pop_d = mpatches.Patch(color='gray', label='confidence interval')
plt.legend(handles=[pop_a, pop_b, pop_c, pop_d], loc='upper left')
plt.xlabel("x")
plt.ylabel("y")
plt.xlim(0, 10)
plt.show()
Conclusion
This article introduced the mathematical intuition & derivation of the core equations working behind the Gaussian Process, we also implemented a naive version of GP from scratch in Python. Gaussian Process can have huge applications where confidence intervals are very crucial, want to replicate behaviours of training data.