Shallow Understanding on Bayesian Optimization
Hard Terms, known in advance :-
Noise-Free Global optimization
Noisy Global optimization
Surrogate Functions (/Gaussian Process)
What is Gaussian Process ??
- It is an Bayesian Regression
- It is an exact interpolation regression method.
- It is a natural generalization of linear regression.
- It specifies a Distribution over functions.
Bayesian Optimization is a method that uses some kind of approximation. Imagine, if we don’t know a function, what we usually do? Ofcourse, we will try to guess or approximate it with some know prior knowledge. The same idea is behind the posteriori probability. The criteria here is that we have observations, where the data are coming records by records (online-learning), so we need to train this model. The trained model will obviously obey a function. That function that we don’t know will absolutely depend on the learnt data. So our task is to find the hyper-parameters that maximizes the learning schema.
We have some variables here. One is the observation records(features + labels) and the second is the parameters, which defines the model.
(say for like in y = mx + c ; m & c are parameters, and y, x are variables-labels & features. We have an obligation to find these m & c to learn our best model)
Bayesian Optimization helps to find a best model among many. We have cross-validation in hand. But how many samples we gonna try on a pre-list to choose a best model among them. That’s why Bayesian approach speed up the process by reducing the computation task and doesn’t expect help from the person to guess the values. This optimization technique is based on randomness and probability distributions.
To say in my words, the first observed point comes to the model. we draw a line. Then the next point follows. We joint those 2 points and draw an adjusted line from previous curve. A third point comes, so we draw a non-linear curve. When the number of observed points increases, the number of combination for possible curves increases. It is same like sampling theorm in Statistics where we estimate a population parameter from sample parameters.
We plot another histogram curve along with the non-linear objective function curve. That tells, where the max value will present among all past observations for a given objective function curve(finding arg max).
Our goal here is not to determine the objective function(which is unknow) as a curve completely with as much as observed points as possible. Instead, we need the arg of the maximum value(index of the max function). So we need to divert out concentration towards the histogram function. When the possible combination of objective function curves increases the histogram forms a distribution, which describes the acquisition funtion. It is the idea behind Bayesian Optimization.
REMEMBER, our goal, primarily need to determine the arg of the max value, and secondarily selecting next best possible max value, which could be on the function curve.
We see many waggle curves in the plot. That tells the range of the function curve next point could be located at given a sampled point. The variance gives the distribution from the mean value on the function curve.
When we estimate by superimposing the observed values over the unknown function curve, then this is noise-free optimization. But when we need to consider noisy optimization, our unknown function will slightly diviates the observed points by a noisy error value.
We define the acquisition function under many measures and some of them are here,
3. MPI = Maximum Probabiity of Improvement
4. MEI = Maximum Expected of Improvement
5. UCB = Upper Confidence Bounds
MM = Maximum Mean
MUI = Maximum Upper Interval
Acquisition Criteria :- which sample should be sleceted next
Surrogate Function/ Response Surface(model)
- posterior over unobserved points based on prior.
- its parameters might be based on prior.
An Acquisition function :-
The measurement is Probability of Improvement(PI).
μ+ = the best observed value among the past points.
We are looking for a max value of the unknown function. But among our past observation, the next possible max value should be larger than the current max observed value.
ε = very small arbitary value denotes the point in Guassian Distribution.
Our Goal :-
PI(x) = The probability of the next max of the unknown function found to be above the max observed value plus some bias.
This is just the 1D integration of the Gussian distribution with the threshold boundary of the max observed value.
For the each point in the unknown function, we consider a Guassian distribution placed vertically at each point samples. This distribution has mean at the value on the unknow function & some certain variance.
We standardize this in the above Kushner 1964 equation.
Exploration & Exploitation are the frequently used terms to explain a phenomena and the optimization algorithm. There is a trade-off exists in-between these 2. Therefore, we balance this tradeoff using the “acquisition function”.
Our next chosen point( x ) should has high mean (exploitation) & high variance (exploration).
Because we search for the next point in high variance distribution. That means exploring the point x.
High mean means we chose the next point( x ) with high offset/ bias so that we should give some exploitation weights to decrease the mean of the next point.
The above picture gives some intuition about our algorithm. The red curve is our True Objective function curve. Unfortunately, we don’t know this function and its equation. So we gonna approximate it using Gaussian Process. Throughout our Sampled Points(given 4 samples here), we draw a intuitive/ confident curve that fits with our observed samples. So the green region shows the Confidence Region, where the most probability of curve points could be positioned. From the above prior knowledge, we determine that second point(f+) as the maximum observed value. So the next max point should be above it or equal. We draw a blue line here. The next max point should fall above this line. So from the intersecting points of f+ and Confidence Region, we could assume that the curve samples below the f+ should be discarded along with our goal to find arg max. Therefore, now we have narrowed down our area of investigation. This same process continues with the next Sampled Points.