Popular Machine Learning Optimization Algorithms

Taffy Das
Nerd For Tech
Published in
4 min readAug 26, 2022
Photo by Conny Schneider on Unsplash

Optimization in machine learning is an iterative process of finding the right predictions given a set of inputs. On each iteration, the goal is to reduce the error margin between your predicted values and what the actual values are, also known as ground truth values. Optimization in deep learning has several implementations. A good loss optimization method should mostly converge to a global minimum if possible and avoid traps like plateaus, saddles and local minimas (if any). Optimization algorithms are required to update the model weights after a forward pass. Gradient Descent is a popular example of said optimization. Some implementations of Gradient Descent include Momentum or AdaMax. Some gradient descent methods are:

A saddle in a function space which could cause problems for optimization since there are several minimas . Nicoguaro, CC BY 3.0, via Wikimedia Commons

Vanilla Gradient Descent

Vanilla gradient descent works great for simple convex functions however they tend to get stuck around plateaus or possible local minimas since the gradient on all sides do not reduce any further. The descent is also not smooth and moves in a zigzag pattern sometimes which slows convergence. Other implementations help circumvent some of these issues.

Formula:

  • delta = — learning_rate * gradient
  • theta += delta
Gradient Descent

Momentum

Momentum is a modification of vanilla gradient descent where some momentum is added to the step of the function. This momentum helps get past certain plateaus and local minimas. This works by adding the movement of the previous step to the current calculated step. This idea is akin to rolling a bowl down a bowl and instead of stopping at the bottom it rolls past the centre based on its initial momentum and slowly settles at the bottom of the bowl. Momentum helps reduce oscillations of the gradient descent and smoothens the movement towards the minima. A decay rate multiplied with the previous steps helps gauge how much of the previous step to add to the current step. A decay rate of 0 is similar to a vanilla network with a lot of friction while a decay rate of 1 always adds the previous step is a never ending movement with no friction. A usual decay rate value of 0.8 to 0.9 is used which usually gives the movement some friction and provides enough momentum to step faster but gradually settle.

Formula:

  • sum_of_gradient = gradient + previous_sum_of_gradient * decay_rate
  • delta = -learning_rate * sum_of_gradient
  • theta += delta

Another version of the momentum formula is adding (1-decay_rate) to scale the gradient:

  • sum_of_gradient = (1-decay_rate) * gradient + previous_sum_of_gradient * decay_rate

No matter the formula version chosen, the appropriate learning rate hyper parameter will have to be selected to move the gradient towards the minima in a smoother manner.

AdaGrad (Adaptive Gradient)

The issue with momentum is that it may get past the minima easily or may take the steepest path first neglecting other possible paths although the steepest path may not necessarily be the fastest path to the minima. AdaGrad penalizes exploring descent in the same direction by introducing a sum of gradients squared. The more a feature gradient is being explored, the higher it’s squared sum and ultimately the smaller its step. This is essentially great for sparse features that might otherwise only have small steps in descent. Each feature parameter has its own learning rate for descent and therefore the algorithm does not heavily rely on the learning rate hyper parameter. Adagrad has a disadvantage of eventually slowing down to negligible descents because of the quick increase in sum of squares score.

Formula:

  • sum_of_gradient_squared = previous_sum_of_gradient_squared + gradient²
  • delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
  • theta += delta

RMSProp (Root Mean Squared Propagation)

RMSProp is an improved version of Adagrad by adding a decay factor to the sum of squared gradients. It prioritizes the current gradient step while reducing the cumulative effect of past gradients. This helps speed up convergence as descent keeps increasing as opposed to Adagrad. For example a decay rate of 0.9 will apply (1–0.9) scale to the current squared gradient which is a 10% step increase compared to Adagrad.

Formula;

  • sum_of_gradient_squared = previous_sum_of_gradient_squared * decay_rate+ gradient² * (1- decay_rate)
  • delta = -learning_rate * gradient / sqrt(sum_of_gradient_squared)
  • theta += delta

ADAM (Adaptive Moment Estimation)

The Adam optimizer uses the best parts of the descent methods and is currently the most popular choice. It implements the Momentum (for speed) and RMSProp algorithms (sparse feature descent). A decay parameter — beta1 is applied to the sum of gradients (first moment) usually set to about 0.9. A second decay parameter — beta2 is also applied to the sum of squared gradients usually set to about 0.999.

Formula:

  • sum_of_gradient = previous_sum_of_gradient * beta1 + gradient * (1 — beta1) [Momentum]
  • sum_of_gradient_squared = previous_sum_of_gradient_squared * beta2 + gradient² * (1- beta2) [RMSProp]
  • delta = -learning_rate * sum_of_gradient / sqrt(sum_of_gradient_squared)
  • theta += delta

Conclusion

I hope this article provided some basic explanations for you in finding the right optimization technique for your machine learning model. Please leave your thoughts and questions in the comments section. Thank you

--

--

Taffy Das
Nerd For Tech

Check out more exciting content on new AI updates and intersection with our daily lives. https://www.youtube.com/channel/UCsZRCvdmMPES2b-wyFsDMiA