ADADELTA: An Adoptive learning rate method
It’s not the mistake that matters; it’s how you deal with it and learn from it.
The aim of many machine learning methods is to update a set of parameters in order to optimize an objective function.
In the above η represent the learning rate which controls how large a step to take in the direction of the negative gradient. In many algorithms, it requires the learning rate hyperparameter to be tuned manually. Choosing a good learning rate speed up the convergence of the system in term of the objective function.
Advantage and drawback of the learning rate
The method of estimation of learning rate at each iteration either attempts to speed up learning when suitable or to slow down learning near a local minimum. Alternatively, we can schedule the learning rate by applying some decay to it so that it will gradually anneal based on how many epochs through the data have been done. Also, the global learning rate is not often advantageous as each dimension of the parameter vector can relate to the overall cost in completely different ways.
momentum, An efficient learning with confidence
By applying some momentum we can accelerate progress along dimensions in which gradient consistently point in the same direction and to slow progress along dimensions where the sign of the gradient continues to change.
The intuition behind the above is
If I am repeatedly being asked to move in the same direction then I should probably gain some confidence and start taking bigger steps in that direction. Just as a ball gains momentum while rolling down a slope.
Adadelta an improvement over Adagrade
ADAGRAD has shown remarkably good results on large scale learning with only first-order information
Here the denominator computes the L2 norms of all previous gradients. Since the magnitudes of gradients are factored out in ADAGRAD, this method can be sensitive to the initial conditions of the parameters and the corresponding gradients. If the initial gradients are large, the learning rates will be low for the remainder of the training.
Drawback of adagrade
1) The decay of learning rates throughout training due to the continual accumulation of squared gradients in the denominator tends to vanishing gradient problem.
2) the need for a manually selected global learning rate.
The intuition behind Adadelta
my learning will more depend upon the local conditions rather than the whole. So that I will motivate towards my goal based upon the recent observations rather than all.
Instead of accumulating the sum of squared gradients over all time, we restricted the window of past gradients that are accumulated to be some fixed size w where w < t (i.e. the number of epoch).This method implements this accumulation as an exponentially decaying average of the squared gradients. Since storing w previous squared gradients is inefficient.
Algorithm
where RMS represents the root mean square of the gradient. and ε a small constant used for better conditioning of the gradient.
Python Implementation
The complete implementation is available in my Github repo
Observations
The following observations are done with MNIST handwritten letter recognization dataset.with 8 Keras Dense layers each with 50 neurons with ‘relu’ activation function in the hidden layer and softmax at the top layer. Loss is calculated through mean squared error.