The Evolution of Gradient Descend Optimization Algorithm
Most popular optimization technique used in nerual networks. Latest algorithms like “lasagne’s”, “caffe’s”, and “kera’s” considered to be black-box optimizers because it is hard to explain them in practical way. GD(Gradient Descend) is a technique to minimize an objective function- J(θ), with model parameters θ.
- constantly updating the parameters in opposite direction to the gradient of the function
- has a parameter Learning Rate, the step size.
- Usually we try to minimize a function. If we find a minimized value, then we can ofcourse find the maximum value from minimum.
Main GD variants are 3.
- Batch Gradient Descend
- Stochastic Gradient Descend (SGD)
- Mini-batch Gradient Descend
We categorize them under the amount of data in training set. Trade-off exists between ACCURACY of the parameter update & TIME takes to perform update.
Batch Gradient Descend (aka Vanilla Gradient Descend)
- cost function is,
- θ — selected parameter for entire training set.
- For ONE update, we should compute the gradient matrix for the whole data set. Therefore, it performs redundant computations in data set. It recomputes same gradients for similar parameter update. Also it doesn’t allow to update online.
- For large data set,
- Very slow
- Intractable
CONVEX error surfaces → converge to GLOBAL minimum
NON-CONVEX surfaces → converge to LOCAL minimum
2. Stochastic Gradient Descend (SGD)
- Contrasts to VGD(Vanilla GD) in performance.
- runs much faster
- used to learn online
- performs frequent updates with high variance -> fluctuates heavily
- overshooting may possible.
Even if BGD converges to minimum, SGD will high fluctuations close to global minimum, may escape from global minimum and fall into local minimum.
Slowing the learning rate in SGD, will show the same behavior of convergence as VGD.
3. Mini-batch Gradient Descend
- BEST of the above 2.
- updates on every mini-batch.
- reduces the parameter-updates’ variance
- stable convergence
- very efficient matrix calculation optimizations.
- common batch size 50–256
- SGD used typically in NN.
Problems encountered along the Gradient Descend variants,
- Choose of learning rate — DIFFICULT
- Learning rate schedules — came later; but not satisfied
- For sparce data set training— INEFFICIENT
- Need to be not traped into local minimum
Gradient Desend Optimization ALGORITHMS (different from VARIANTS)
- Used in Deep learning
1) MOMENTUM
- Helps to accelerate SGD in relevent direction
- dampens oscillations
- uses a fraction-γ in the updating process
Here the (+) may change due to hill climb or descend cases.
- Usually γ sets close to 0.9
- Same phenomina as
Ball falling over air resistance (γ comapres to air resistance)
Momemtum value;
increases — if gradient point on same towards mimnimum
reduces — if gradient changes direction
2) Nesterov Accelerated Gradient (NAG)
- Until now in above algorithms, the path followed blindly.
- But we need a smart processing at each instance on keep aware of the path.
The parameter looks ahead and update.
γ sets close to 0.9
- This anticipatory update prevents fast running.
- results in increased responsiveness.
- increased performace in RNN(Recurrent Neural Network).
3) Adagrad (Adaptive Gradient)
- Adapts the learning rate to each PARAMETERS.
IN-FREQUENT parameters → larger updates
FREQUENT parameters → smaller updates
- well suited for sparse data.
- improves the SGD robustness.
- uses different learning rate for every parameter — θi at each instance.
- updates each parameter based on past gradients (cumultion of sum of gradient squares)
- Gt — diagonal matrix; each element is sum of squares of past gradients for each parameter.
- ϵ — smoothing term (avoids “divide by zero” case, usually order of 1e-8)
If we remove the square root opertaion, the result becomes far more worser than remaining with square root operation.
Vector representation of our equation,
the notation — ⊙ represents matrix multiplication.
— Removes the need of manual tuning of learning rate.
— usual default value of learning rate is 0.01
Weakness— denominator(sum of square accumulation) grows up on training.
→ sinks learning rate
— — >> at end becomes insignificant
— — — >>> results in non-convergence to minimum
Remedy is Adadelta technique.
3) Adadelta (Adaptive Delta)
- extension of Adagrad
— reduces Adagrad’s aggressiveness
— monotonically decreasing learning rate
- use a fixed size w for the consideration of accumulated past sum of squares of gradients.
decaying average of past squared gradients ->E[g^2]t
- It is the Running average @ time t depends only on
— previous average
— current gradient
- set γ to 0.9
parameter update vector — > Δθt.
vector notation of parameter update is,
Here the denominator represents the RMS(root mean square error) criterion of gradient.
dimension of units didn’t match
4) RMSprop
- unpublished algorithm yet.
- adaptive learning rate method proposed by University of Toronto professor.
- RMSprop & Adadelta developed at the same time independently.
- RMSprop divides LEARNING RATE by an EXPONENTIAL DECAYING AVERAGE of squared gradients.
- γ sets to 0.9
learning rate (good value) := 0.001
5) Adam (Adaptive Moment Estimation)
- computes adaptive learning rates for each parameter — θ.
- Other than keeping “past sum of squared gradients”, this also keeps exponentially decaying “past gradients(similar to momentum)”.
mt->ESTIMATES of first moment(the mean)
vt ->ESTIMATES of second moment(the uncentered variance)
……of gradients
- Therefore, the name comes AME(Adaptive Moment Estimation)
ReMeMbEr — MOMENTUM and MOMENT are different here.
- Those estimations are initialized to zeros at the beginning.
- biased towards zero (when INITAL TIME STEPS & DECAY RATES ARE SMALL; β1 and β2 close to 1)
- The author initializes the below biases from by pre-calculation analysis.
(bias-corrected FIRST and SECOND moment estimates)
Updation rule here
β1 — > 0.9
β2 — > 0.999
ϵ — > 10–8
….. the above values are defined by Adam author.
- This is the most favorable method used in applications than the above.
Summary :-
- Momentum
- NAG(Nesterov Accelerated Gradient)
- Adagrad
- Adadelta
- RMSprop
- Adam
Rule of Thumb :-
Adam usually outperforms the rest;
followed by AdaGrad & AdaDelta;
If the data is sparse, Momentum, SGD, and Nesterov are inferior.