The Evolution of Gradient Descend Optimization Algorithm

Ramraj Chandradevan
6 min readJul 13, 2017

--

Most popular optimization technique used in nerual networks. Latest algorithms like “lasagne’s”, “caffe’s”, and “kera’s” considered to be black-box optimizers because it is hard to explain them in practical way. GD(Gradient Descend) is a technique to minimize an objective function- J(θ), with model parameters θ.

  • constantly updating the parameters in opposite direction to the gradient of the function
  • has a parameter Learning Rate, the step size.
  • Usually we try to minimize a function. If we find a minimized value, then we can ofcourse find the maximum value from minimum.
min & max of a function

Main GD variants are 3.

  1. Batch Gradient Descend
  2. Stochastic Gradient Descend (SGD)
  3. Mini-batch Gradient Descend

We categorize them under the amount of data in training set. Trade-off exists between ACCURACY of the parameter update & TIME takes to perform update.

Batch Gradient Descend (aka Vanilla Gradient Descend)

  • cost function is,
Cost Function
  • θ — selected parameter for entire training set.
  • For ONE update, we should compute the gradient matrix for the whole data set. Therefore, it performs redundant computations in data set. It recomputes same gradients for similar parameter update. Also it doesn’t allow to update online.
  • For large data set,
  1. Very slow
  2. Intractable

CONVEX error surfaces → converge to GLOBAL minimum

NON-CONVEX surfaces → converge to LOCAL minimum

2. Stochastic Gradient Descend (SGD)

  • Contrasts to VGD(Vanilla GD) in performance.
  • runs much faster
  • used to learn online
x(i) and y(i) are labels
  • performs frequent updates with high variance -> fluctuates heavily
  • overshooting may possible.

Even if BGD converges to minimum, SGD will high fluctuations close to global minimum, may escape from global minimum and fall into local minimum.

Slowing the learning rate in SGD, will show the same behavior of convergence as VGD.

3. Mini-batch Gradient Descend

  • BEST of the above 2.
  • updates on every mini-batch.
mini batch of n
  • reduces the parameter-updates’ variance
  • stable convergence
  • very efficient matrix calculation optimizations.
  • common batch size 50–256
  • SGD used typically in NN.

Problems encountered along the Gradient Descend variants,

  1. Choose of learning rate — DIFFICULT
  2. Learning rate schedules — came later; but not satisfied
  3. For sparce data set training— INEFFICIENT
  4. Need to be not traped into local minimum

Gradient Desend Optimization ALGORITHMS (different from VARIANTS)

  • Used in Deep learning

1) MOMENTUM

  • Helps to accelerate SGD in relevent direction
  • dampens oscillations
  • uses a fraction-γ in the updating process

Here the (+) may change due to hill climb or descend cases.

  • Usually γ sets close to 0.9
  • Same phenomina as

Ball falling over air resistance (γ comapres to air resistance)

Momemtum value;

increases — if gradient point on same towards mimnimum

reduces — if gradient changes direction

2) Nesterov Accelerated Gradient (NAG)

  • Until now in above algorithms, the path followed blindly.
  • But we need a smart processing at each instance on keep aware of the path.

The parameter looks ahead and update.

J(this value is updated at first)

γ sets close to 0.9

  • This anticipatory update prevents fast running.
  • results in increased responsiveness.
  • increased performace in RNN(Recurrent Neural Network).
The Picture of Nesterov Accelerated Gradient Descend

3) Adagrad (Adaptive Gradient)

  • Adapts the learning rate to each PARAMETERS.

IN-FREQUENT parameters → larger updates

FREQUENT parameters → smaller updates

  • well suited for sparse data.
  • improves the SGD robustness.
  • uses different learning rate for every parameter — θi at each instance.
  • updates each parameter based on past gradients (cumultion of sum of gradient squares)
  • Gt diagonal matrix; each element is sum of squares of past gradients for each parameter.
  • ϵsmoothing term (avoids “divide by zero” case, usually order of 1e-8)

If we remove the square root opertaion, the result becomes far more worser than remaining with square root operation.

Vector representation of our equation,

the notation — ⊙ represents matrix multiplication.

— Removes the need of manual tuning of learning rate.

— usual default value of learning rate is 0.01

Weakness— denominator(sum of square accumulation) grows up on training.

→ sinks learning rate

— — >> at end becomes insignificant

— — — >>> results in non-convergence to minimum

Remedy is Adadelta technique.

3) Adadelta (Adaptive Delta)

  • extension of Adagrad

— reduces Adagrad’s aggressiveness

monotonically decreasing learning rate

  • use a fixed size w for the consideration of accumulated past sum of squares of gradients.

decaying average of past squared gradients ->E[g^2]t

  • It is the Running average @ time t depends only on

— previous average

— current gradient

  • set γ to 0.9

parameter update vector — > Δθt.

vector notation of parameter update is,

Here the denominator represents the RMS(root mean square error) criterion of gradient.

dimension of units didn’t match

4) RMSprop

  • unpublished algorithm yet.
  • adaptive learning rate method proposed by University of Toronto professor.
  • RMSprop & Adadelta developed at the same time independently.
  • RMSprop divides LEARNING RATE by an EXPONENTIAL DECAYING AVERAGE of squared gradients.
  • γ sets to 0.9

learning rate (good value) := 0.001

5) Adam (Adaptive Moment Estimation)

  • computes adaptive learning rates for each parameter — θ.
  • Other than keeping “past sum of squared gradients”, this also keeps exponentially decayingpast gradients(similar to momentum)”.
Variable Update Expression

mt->ESTIMATES of first moment(the mean)

vt ->ESTIMATES of second moment(the uncentered variance)

……of gradients

  • Therefore, the name comes AME(Adaptive Moment Estimation)

ReMeMbEr — MOMENTUM and MOMENT are different here.

  • Those estimations are initialized to zeros at the beginning.
  • biased towards zero (when INITAL TIME STEPS & DECAY RATES ARE SMALL; β1 and β2 close to 1)
  • The author initializes the below biases from by pre-calculation analysis.

(bias-corrected FIRST and SECOND moment estimates)

Updation rule here

Adam Update Rule

β1 — > 0.9

β2 — > 0.999

ϵ — > 10–8

….. the above values are defined by Adam author.

  • This is the most favorable method used in applications than the above.

Summary :-

  1. Momentum
  2. NAG(Nesterov Accelerated Gradient)
  3. Adagrad
  4. Adadelta
  5. RMSprop
  6. Adam

Rule of Thumb :-

Adam usually outperforms the rest;

followed by AdaGrad & AdaDelta;

If the data is sparse, Momentum, SGD, and Nesterov are inferior.

References :-

--

--

Ramraj Chandradevan

Undergrad student@ University of Moratuwa-Sri Lanka. Former Deep Learning Intern @ Emory University.