ADADELTA: An Adoptive learning rate method

Sourav Dash
4 min readApr 18, 2020

It’s not the mistake that matters; it’s how you deal with it and learn from it.

The aim of many machine learning methods is to update a set of parameters in order to optimize an objective function.

general update rule

In the above η represent the learning rate which controls how large a step to take in the direction of the negative gradient. In many algorithms, it requires the learning rate hyperparameter to be tuned manually. Choosing a good learning rate speed up the convergence of the system in term of the objective function.

Advantage and drawback of the learning rate

The method of estimation of learning rate at each iteration either attempts to speed up learning when suitable or to slow down learning near a local minimum. Alternatively, we can schedule the learning rate by applying some decay to it so that it will gradually anneal based on how many epochs through the data have been done. Also, the global learning rate is not often advantageous as each dimension of the parameter vector can relate to the overall cost in completely different ways.

momentum, An efficient learning with confidence

By applying some momentum we can accelerate progress along dimensions in which gradient consistently point in the same direction and to slow progress along dimensions where the sign of the gradient continues to change.

inducing momentum by applying the moving average

The intuition behind the above is

If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking bigger steps in that direction. Just as a ball gains momentum while rolling down a slope.

Adadelta an improvement over Adagrade

ADAGRAD has shown remarkably good results on large scale learning with only first-order information

Here the denominator computes the L2 norms of all previous gradients. Since the magnitudes of gradients are factored out in ADAGRAD, this method can be sensitive to the initial conditions of the parameters and the corresponding gradients. If the initial gradients are large, the learning rates will be low for the remainder of the training.

Drawback of adagrade

1) The decay of learning rates throughout training due to the continual accumulation of squared gradients in the denominator tends to vanishing gradient problem.
2) the need for a manually selected global learning rate.

The intuition behind Adadelta

my learning will more depend upon the local conditions rather than the whole. So that I will motivate towards my goal based upon the recent observations rather than all.

Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size w where w < t (i.e. the number of epoch).This method implements this accumulation as an exponentially decaying average of the squared gradients. Since storing w previous squared gradients is inefficient.

Algorithm

where RMS represents the root mean square of the gradient. and ε a small constant used for better conditioning of the gradient.

Python Implementation

The complete implementation is available in my Github repo

Observations

The following observations are done with MNIST handwritten letter recognization dataset.with 8 Keras Dense layers each with 50 neurons with ‘relu’ activation function in the hidden layer and softmax at the top layer. Loss is calculated through mean squared error.

validation loss comparison between adagrade and adadelta
validation accuracy comparison between adagrade and adadelta

--

--