Shallow Understanding on Bayesian Optimization

Sample data increases & Curve fitting gets improved.

Hard Terms, known in advance :-

Gaussian Process
Noise-Free Global optimization
Noisy Global optimization
Acquisition functions
Surrogate Functions (/Gaussian Process)
Stochastic
Bayesian Regression

What is Gaussian Process ??

- It is an Bayesian Regression

- It is an exact interpolation regression method.

- It is a natural generalization of linear regression.

  • It specifies a Distribution over functions.

Bayesian Optimization is a method that uses some kind of approximation. Imagine, if we don’t know a function, what we usually do? Ofcourse, we will try to guess or approximate it with some know prior knowledge. The same idea is behind the posteriori probability. The criteria here is that we have observations, where the data are coming records by records (online-learning), so we need to train this model. The trained model will obviously obey a function. That function that we don’t know will absolutely depend on the learnt data. So our task is to find the hyper-parameters that maximizes the learning schema.

We have some variables here. One is the observation records(features + labels) and the second is the parameters, which defines the model.

eg.

(say for like in y = mx + c ; m & c are parameters, and y, x are variables-labels & features. We have an obligation to find these m & c to learn our best model)

Bayesian Optimization helps to find a best model among many. We have cross-validation in hand. But how many samples we gonna try on a pre-list to choose a best model among them. That’s why Bayesian approach speed up the process by reducing the computation task and doesn’t expect help from the person to guess the values. This optimization technique is based on randomness and probability distributions.

To say in my words, the first observed point comes to the model. we draw a line. Then the next point follows. We joint those 2 points and draw an adjusted line from previous curve. A third point comes, so we draw a non-linear curve. When the number of observed points increases, the number of combination for possible curves increases. It is same like sampling theorm in Statistics where we estimate a population parameter from sample parameters.

We plot another histogram curve along with the non-linear objective function curve. That tells, where the max value will present among all past observations for a given objective function curve(finding arg max).

Histogram Curve, that is the Acquisition Function (includes different acq functions).

Our goal here is not to determine the objective function(which is unknow) as a curve completely with as much as observed points as possible. Instead, we need the arg of the maximum value(index of the max function). So we need to divert out concentration towards the histogram function. When the possible combination of objective function curves increases the histogram forms a distribution, which describes the acquisition funtion. It is the idea behind Bayesian Optimization.

REMEMBER, our goal, primarily need to determine the arg of the max value, and secondarily selecting next best possible max value, which could be on the function curve.
Many Random Ensemble Curves that we could draw from the above observed black samples.

We see many waggle curves in the plot. That tells the range of the function curve next point could be located at given a sampled point. The variance gives the distribution from the mean value on the function curve.

When we estimate by superimposing the observed values over the unknown function curve, then this is noise-free optimization. But when we need to consider noisy optimization, our unknown function will slightly diviates the observed points by a noisy error value.

We define the acquisition function under many measures and some of them are here,

1. PMAX
2. IEMAX
3. MPI = Maximum Probabiity of Improvement
4. MEI = Maximum Expected of Improvement
5. UCB = Upper Confidence Bounds
6. GP-Hedge
MM = Maximum Mean
MUI = Maximum Upper Interval

Acquisition Criteria :- which sample should be sleceted next

Surrogate Function/ Response Surface(model)

  • posterior over unobserved points based on prior.
  • its parameters might be based on prior.

An Acquisition function :-

The measurement is Probability of Improvement(PI).

μ+ = the best observed value among the past points.

We are looking for a max value of the unknown function. But among our past observation, the next possible max value should be larger than the current max observed value.

ε = very small arbitary value denotes the point in Guassian Distribution.

Our Goal :-

Find the arg of the max of f(X)

PI(x) = The probability of the next max of the unknown function found to be above the max observed value plus some bias.

This is just the 1D integration of the Gussian distribution with the threshold boundary of the max observed value.

For the each point in the unknown function, we consider a Guassian distribution placed vertically at each point samples. This distribution has mean at the value on the unknow function & some certain variance.

We standardize this in the above Kushner 1964 equation.

Exploration & Exploitation are the frequently used terms to explain a phenomena and the optimization algorithm. There is a trade-off exists in-between these 2. Therefore, we balance this tradeoff using the “acquisition function”.

Our next chosen point( x ) should has high mean (exploitation) & high variance (exploration).

Choosing our next possible max point, with trade-off between Exploration & Exploitation.

Because we search for the next point in high variance distribution. That means exploring the point x.

High mean means we chose the next point( x ) with high offset/ bias so that we should give some exploitation weights to decrease the mean of the next point.

The way we approach Bayesian Optimization.

The above picture gives some intuition about our algorithm. The red curve is our True Objective function curve. Unfortunately, we don’t know this function and its equation. So we gonna approximate it using Gaussian Process. Throughout our Sampled Points(given 4 samples here), we draw a intuitive/ confident curve that fits with our observed samples. So the green region shows the Confidence Region, where the most probability of curve points could be positioned. From the above prior knowledge, we determine that second point(f+) as the maximum observed value. So the next max point should be above it or equal. We draw a blue line here. The next max point should fall above this line. So from the intersecting points of f+ and Confidence Region, we could assume that the curve samples below the f+ should be discarded along with our goal to find arg max. Therefore, now we have narrowed down our area of investigation. This same process continues with the next Sampled Points.

Data sample increases; comparison of different Acquisition functions with Curve Fitting.

Reference :-