The Evolution of Gradient Descend Optimization Algorithm

6 min readJul 13, 2017

Most popular optimization technique used in nerual networks. Latest algorithms like “lasagne’s”, “caffe’s”, and “kera’s” considered to be black-box optimizers because it is hard to explain them in practical way. GD(Gradient Descend) is a technique to minimize an objective function- J(θ), with model parameters θ.

constantly updating the parameters in opposite direction to the gradient of the function
has a parameter Learning Rate, the step size.
Usually we try to minimize a function. If we find a minimized value, then we can ofcourse find the maximum value from minimum.

Main GD variants are 3.

Batch Gradient Descend
Stochastic Gradient Descend (SGD)
Mini-batch Gradient Descend

We categorize them under the amount of data in training set. Trade-off exists between ACCURACY of the parameter update & TIME takes to perform update.

Batch Gradient Descend (aka Vanilla Gradient Descend)

cost function is,

Cost Function

θ — selected parameter for entire training set.
For ONE update, we should compute the gradient matrix for the whole data set. Therefore, it performs redundant computations in data set. It recomputes same gradients for similar parameter update. Also it doesn’t allow to update online.
For large data set,

Very slow
Intractable

CONVEX error surfaces → converge to GLOBAL minimum
NON-CONVEX surfaces → converge to LOCAL minimum

2. Stochastic Gradient Descend (SGD)

Contrasts to VGD(Vanilla GD) in performance.
runs much faster
used to learn online

performs frequent updates with high variance -> fluctuates heavily
overshooting may possible.

Even if BGD converges to minimum, SGD will high fluctuations close to global minimum, may escape from global minimum and fall into local minimum.

Slowing the learning rate in SGD, will show the same behavior of convergence as VGD.

3. Mini-batch Gradient Descend

BEST of the above 2.
updates on every mini-batch.

reduces the parameter-updates’ variance
stable convergence
very efficient matrix calculation optimizations.
common batch size 50–256
SGD used typically in NN.

Problems encountered along the Gradient Descend variants,

Choose of learning rate — DIFFICULT
Learning rate schedules — came later; but not satisfied
For sparce data set training— INEFFICIENT
Need to be not traped into local minimum

Gradient Desend Optimization ALGORITHMS (different from VARIANTS)

Used in Deep learning

1) MOMENTUM

Helps to accelerate SGD in relevent direction
dampens oscillations
uses a fraction-γ in the updating process

Here the (+) may change due to hill climb or descend cases.

Usually γ sets close to 0.9
Same phenomina as

Ball falling over air resistance (γ comapres to air resistance)
Momemtum value;
increases — if gradient point on same towards mimnimum
reduces — if gradient changes direction

2) Nesterov Accelerated Gradient (NAG)

Until now in above algorithms, the path followed blindly.
But we need a smart processing at each instance on keep aware of the path.

The parameter looks ahead and update.

γ sets close to 0.9

This anticipatory update prevents fast running.
results in increased responsiveness.
increased performace in RNN(Recurrent Neural Network).

The Picture of Nesterov Accelerated Gradient Descend

3) Adagrad (Adaptive Gradient)

Adapts the learning rate to each PARAMETERS.

IN-FREQUENT parameters → larger updates
FREQUENT parameters → smaller updates

well suited for sparse data.
improves the SGD robustness.
uses different learning rate for every parameter — θi at each instance.

updates each parameter based on past gradients (cumultion of sum of gradient squares)
Gt — diagonal matrix; each element is sum of squares of past gradients for each parameter.
ϵ — smoothing term (avoids “divide by zero” case, usually order of 1e-8)

If we remove the square root opertaion, the result becomes far more worser than remaining with square root operation.

Vector representation of our equation,

the notation — ⊙ represents matrix multiplication.

— Removes the need of manual tuning of learning rate.

— usual default value of learning rate is 0.01

Weakness— denominator(sum of square accumulation) grows up on training.

→ sinks learning rate

— — >> at end becomes insignificant

— — — >>> results in non-convergence to minimum

Remedy is Adadelta technique.

3) Adadelta (Adaptive Delta)

extension of Adagrad

— reduces Adagrad’s aggressiveness

— monotonically decreasing learning rate

use a fixed size w for the consideration of accumulated past sum of squares of gradients.

decaying average of past squared gradients ->E[g^2]t

It is the Running average @ time t depends only on

— previous average
— current gradient

set γ to 0.9

parameter update vector — > Δθt.

vector notation of parameter update is,

Here the denominator represents the RMS(root mean square error) criterion of gradient.

dimension of units didn’t match

4) RMSprop

unpublished algorithm yet.
adaptive learning rate method proposed by University of Toronto professor.
RMSprop & Adadelta developed at the same time independently.
RMSprop divides LEARNING RATE by an EXPONENTIAL DECAYING AVERAGE of squared gradients.
γ sets to 0.9

learning rate (good value) := 0.001

5) Adam (Adaptive Moment Estimation)

computes adaptive learning rates for each parameter — θ.
Other than keeping “past sum of squared gradients”, this also keeps exponentially decaying “past gradients(similar to momentum)”.

mt->ESTIMATES of first moment(the mean)

vt ->ESTIMATES of second moment(the uncentered variance)

……of gradients

Therefore, the name comes AME(Adaptive Moment Estimation)

ReMeMbEr — MOMENTUM and MOMENT are different here.

Those estimations are initialized to zeros at the beginning.
biased towards zero (when INITAL TIME STEPS & DECAY RATES ARE SMALL; β1 and β2 close to 1)
The author initializes the below biases from by pre-calculation analysis.

(bias-corrected FIRST and SECOND moment estimates)