# Shallow Understanding on Bayesian Optimization

#### Hard Terms, known in advance :-

Gaussian Process

Noise-Free Global optimization

Noisy Global optimization

Acquisition functions

Surrogate Functions (/Gaussian Process)

Stochastic

Bayesian Regression

What is Gaussian Process ??

- It is an **Bayesian Regression**

- It is an **exact** interpolation **regression** method.

- It is a **natural generalization** of **linear** regression.

- It specifies a
**Distribution****over***functions*.

B**ayesian Optimization** is a method that uses some kind of approximation. Imagine, if we don’t know a function, what we usually do? Ofcourse, we will try to guess or approximate it with some know prior knowledge. The same idea is behind the **posteriori probability**. The criteria here is that we have **observations**, where the data are coming records by records (**online-learning**), so we need to train this model. The trained model will obviously obey a function. That function that we don’t know will absolutely depend on the **learnt data**. So our task is to find the **hyper-parameters** that **maximizes** the learning schema.

We have some *variables* here. One is the **observation records(features + labels)** and the second is the **parameters**, which defines the model.

### eg.

(say for like iny = mx + c; m & c are parameters, and y, x are variables-labels & features. We have an obligation to find these m & c to learn our best model)

Bayesian Optimization helps to find a **best model** among many. We have cross-validation in hand. But how many samples we gonna try on a pre-list to choose a best model among them. That’s why Bayesian approach **speed up** the process by reducing the computation task and doesn’t expect help from the person to guess the values. This optimization technique is based on **randomness** and **probability distributions.**

To say in my words, the *first observed point* comes to the model. we draw a line. Then the next point follows. We joint those 2 points and draw an adjusted line from previous curve. A third point comes, so we draw a non-linear curve. When the number of observed points increases, the number of combination for possible curves increases. *It is same like sampling theorm in Statistics where we estimate a population parameter from sample parameters.*

We plot another **histogram curve** along with the non-linear objective function curve. That tells, where the **max** value will present among all past observations for a given objective function curve(finding **arg** max).

Our goal here is **not** to determine the objective function(which is unknow) as a curve completely with as much as observed points as possible. Instead, we need the **arg of the maximum value**(index of the max function). So we need to divert out concentration towards the histogram function. When the possible combination of objective function curves increases the histogram forms a distribution, which describes the acquisition funtion. It is the idea behind Bayesian Optimization.

REMEMBER, our goal, primarily need to determine the arg of the max value, and secondarily selecting next best possible max value, which could be on the function curve.

We see many **waggle curves** in the plot. That tells the range of the function curve **next point could be located** at given a sampled point. The variance gives the distribution from the mean value on the function curve.

When we estimate by superimposing the observed values over the unknown function curve, then this is **noise-free optimization**. But when we need to consider **noisy optimization**, our unknown function will slightly diviates the observed points by a noisy error value.

We define the **acquisition function** under many measures and some of them are here,

1. PMAX

2. IEMAX

3. MPI = Maximum Probabiity of Improvement

4. MEI = Maximum Expected of Improvement

5. UCB = Upper Confidence Bounds

6. GP-Hedge

MM = Maximum Mean

MUI = Maximum Upper Interval

**Acquisition Criteria** :- *which sample should be sleceted next*

**Surrogate Function/ Response Surface(model)**

- posterior over
**unobserved**points**based on prior**. - its
**parameters**might be based on prior.

### An Acquisition function :-

The measurement is *Probability of Improvement(PI).*

**μ**+ = the **best** **observed** value among the past points.

We are looking for a **max** value of the **unknown function. **But among our past observation, the **next** **possible** max value should be larger than the **current max observed value.**

**ε** = **very small** arbitary value denotes the point in Guassian Distribution.

### Our Goal :-

**PI(x)** = The probability of the **next max** of the **unknown function** found to be above the **max observed value** plus **some bias.**

This is just the 1D integration of the Gussian distribution with the threshold boundary of the max observed value.

For the each point in the unknown function, we consider a Guassian distribution placed vertically at each point samples. This distribution has mean at the value on the unknow function & some certain variance.

We standardize this in the above ** Kushner 1964** equation.

**Exploration & Exploitation** are the frequently used terms to explain a phenomena and the optimization algorithm. There is a trade-off exists in-between these 2. Therefore, we **balance** this tradeoff using the “**acquisition function**”.

Our **next chosen point( x ) **should has **high mean** (*exploitation*) & **high variance** (*exploration*).

Because we search for the next point in high variance distribution. That means exploring the point x.

High mean means we chose the next point( x ) with high offset/ bias so that we should give some exploitation weights to decrease the mean of the next point.

The above picture gives some intuition about our algorithm. The red curve is our **True Objective** function curve. Unfortunately, we don’t know this function and its equation. So we gonna approximate it using **Gaussian Process. **Throughout our** Sampled Points**(given 4 samples here), we draw a intuitive/ confident curve that fits with our observed samples. So the green region shows the **Confidence Region**, where the most probability of curve points could be positioned. From the above **prior knowledge**, we determine that second point(**f+**) as the **maximum observed value**. So the next max point should be above it or equal. We draw a blue line here. The next max point should fall above this line. So from the **intersecting** points of** f+** and **Confidence Region, **we could assume that the curve samples below the** f+ **should be discarded along with our goal to find **arg max. **Therefore, now we have narrowed down our area of investigation. This same process continues with the next **Sampled Points.**