Strengths and Weaknesses of Optimization Algorithms Used for Machine Learning

Published in

The Startup

5 min readAug 31, 2020

A deep-dive into Gradient Descent and other optimization algorithms

Optimization Algorithms for machine learning are often used as a black box. We will study some popular algorithms and try to understand the circumstances under which they perform the best.

The purpose of this blog is to:

Understand Gradient descent and its variants, such as Momentum, Adagrad, RMSProp, NAG, and Adam;
Introduce techniques to improve the performance of Gradient descent; and
Summarize the advantages and disadvantages of various optimization techniques.

Hopefully, by the end of this read, you will find yourself equipped with intuitions towards the behavior of different algorithms, and when to use them.

Gradient descent

Gradient descent is a way to minimize an objective function J(w),by updating parameters(w) in opposite direction of the gradient of the objective function. The learning rate η determines the size of the steps that we take to reach a minimum value of objective function.

Advantages and Disadvantages of Gradient Descent:

**Advantages and Disadvantages of Gradient Descent**

Overall Drawbacks of Gradient Descent:

Techniques to improve the performance of Gradient Descent:

Techniques to improve performance of Gradient Descent

Variations of Gradient Descent

Momentum
Nesterov Accelerated Gradient
Adagrad
Adadelta/RMSProp
Adam

Momentum

SGD (Stochastic Gradient Descent)has trouble navigating areas where the surface of optimization function curves much more steeply into one dimension than in another.

Momentum is a method that helps to accelerate SGD in the relevant direction, and dampens oscillations as can be seen below. It does this by adding a fraction using γ. γ is based on the principle of exponentially weighed averages, where weights are not drastically impacted by a single data point.

The momentum term increases in dimensions whose gradient continues in the same direction, and reduces updates for dimensions whose gradients change directions. As a result, we get faster convergence and reduced oscillation.

Nesterov accelerated gradient(NAG)

Nesterov accelerated gradient (NAG) is a way to give momentum more precision.

We will use momentum to move the parameters. However, computing our parameters as shown below, gives us an approximation of the the gradient at the next position. We can now effectively look ahead by calculating the gradient not w.r.t. to our current value of the parameters but w.r.t. the approximate future value of the parameters.

The momentum term will make appropriate corrections and not overreach.

In case future slope is steeper, momentum will speed up.
In case future slope is low, momentum will slow down.

Adagrad

The motivation behind Adagrad is to have different learning rates for each neuron of each hidden layer for each iteration.

But why do we need different learning rates?

Data sets have two types of features:

Dense features, e.g. House Price Data set (Large number of non-zero valued features), where we should perform smaller updates on such features; and
Sparse Features, e.g. Bag of words (Large number of zero valued features), where we should perform larger updates on such features.

It has been found that Adagrad greatly improved the robustness of SGD, and is used for training large-scale neural nets at Google.

η : initial Learning rate
ϵ : smoothing term that avoids division by zero
w: Weight of parameters

As the number of iterations increases, α increases. As α increases, learning rate at each time step decreases.

One of Adagrad’s main benefits is that it eliminates the need to manually tune the learning rate. Most implementations use a default value of 0.01 and leave it at that.

Adagrad’s weakness is its accumulation of the squared gradients in the denominator. Since every added term is positive, the accumulated sum keeps growing during training. This in turn causes the learning rate to shrink and eventually become infinitesimally small, at which point the algorithm is no longer able to acquire additional knowledge. The next optimization algorithms aim to resolve this flaw.

Adadelta/RMSProp

Adadelta/RMSProp are very similar in their implementation and they overcome the shortcomings of Adagrad’s accumulated gradient by using exponentially weighted moving average gradients.

Instead of inefficiently storing all previous squared gradients, we recursively define a decaying average of all past squared gradients. The running average at each time step then depends (as a fraction γ , similarly to the Momentum term) only on the previous average and the current gradient.

Any new input data points do not dramatically change the gradients, and will hence converge to the local minima with a smoother and shorter path.

Adam

Adam optimizer is highly versatile, and works in a large number of scenarios. It takes the best of RMS-Prop and momentum, and combines them together.

Whereas momentum can be seen as a ball running down a slope, Adam behaves like a heavy ball with friction, which thus prefers flat minima in the error surface.

As it can be anticipated, Adam accelerates and decelerates thanks to the component associated to momentum. At the same time, Adam keeps its learning rate adaptive which can be attributed to the component associated to RMS-Prop.

Default values of 0.9 for β1 is 0.999 for β2 is , and 10pow(-8) for ϵ. Empirically, Adam works well in practice and compares favorably to other adaptive learning-method algorithms.