Optimizers in Machine Learning
The optimizer is a crucial element in the learning process of the ML model. PyTorch itself has 13 optimizers, making it challenging and overwhelming to pick the right one for the problem.
In this tutorial, I will go through the five most popular optimizers explaining their strengths and limits along with the math behind them. So, let’s get into it!
What is optimization?
The ultimate goal of ML model is to reach the minimum of the loss function. After we pass input, we calculate the error and update the weights accordingly. This is where optimizer comes into play. It defines how to tweak the parameters to get closer to the minima.
So essentially, optimization is a process of finding optimal parameters for the model, which significantly reduces the error function.
Vanilla Gradient Descent
There are three different variants of Gradient Descent in Machine Learning:
- Stochastic Gradient Descent(SGD) — calculates gradient for each random sample
- Mini-Batch Gradient Descent — computes gradient over randomly sampled batch
- Batch Gradient Descent — computes gradients for the entire dataset
As you might think, updating the weights for each sample makes the training unstable and messy. On the other hand, passing the whole dataset at once is slow or even impossible with larger datasets like ImageNet.
Mini-Batch GD is a bit of both and currently is the go-to algorithm to train Deep Learning models. Mainly because it utilizes the abilities of GPU and makes the training more stable.
Nowadays, the SGD mainly refers to the Mini-Batch Gradient Descent, so we will stick to that convention for the rest of the blog.
- always converge
- easy to compute
- easily get stuck in local minima or saddle points
- sensitive to the learning rate
SGD is a base optimization algorithm from the 50s. It is straightforward and easy to compute, but it faces significant challenges, especially with the more complex models.
We know if the slope is 0, then the model converged. While it is the case in the convex functions (one minimum), most deep learning models are non-convex(multiple local minima). In this case, we can get stacked at one of those points and might never reach the global minima.
Surprisingly, the local minima do not occur as often as the saddle points in the more complex deep learning models.
Before we move to the fancier techniques to combat this problem, I will introduce the concept that I think is the key to understanding all the other optimizers.
Exponential Moving Average
As you can see, EMA smoothes the graph and reduces the oscillations. The parameter β defines the importance of the new point and the weighted average.
As you can see in the example above, the weight β is growing exponentially. Since β < 1, the significance of old terms decreases, and we consider more recent points.
Essentially EMA is reducing the wiggling and creates the average trajectory. That is what we want for our optimizer!
- helps to avoid saddle points and minima
- converge faster
- reduce oscillations
- the same learning rate for all parameters
The popular story about the momentum says:
SGD is a walking man downhill, slowly but steady. Momentum is a heavy ball running downhill, smooth and fast.
The momentum leverages the EMA ability to reduce the gradient oscillations in the gradient that change direction and build up the momentum where the gradient points steadily. It helps us “roll over” the local minima and plateaus and continue to go towards the global minimum.
The optimizer seems to be going more steadily towards minima with gained momentum. Nevertheless, it still overshoots in the directions it is moving. This challenge was addressed by the AdaGrad optimizer.
- adaptive learning rate to the parameters
- no manual tuning of learning rate
- learning rate disappears
The AdaGrad is the first algorithm that introduced the adaptive learning rate for different model parameters. Let me show you why it’s important:
If the learning rate is too high for a large gradient, we overshoot and bounce around. If the learning rate is too low, the learning is slow and might never converge.
AdaGrad uses the sum of squared previous gradients to combat this issue. If the gradient is high, then the learning rate is reduced. And if the gradient is low, then it’s increased. In this way, the algorithm adapts the size of the steps smoothly along all dimensions.
The AdaGrad is mainly used with sparse data where the infrequent features get much larger updates than the frequent ones.
It’s a brilliant solution, but since we accumulate the squared gradients, the learning rate will decrease with each iteration and eventually shrink. Then the learning process might stop before we reach convergence.
- the learning rate doesn’t vanish
- adaptive learning rate for each parameter
RMSProp is an upgraded version of AdaGrad that leverages mighty EMA(again). Instead of only accumulating the squared gradients, we control the amount of previous information. Thus the denominator won’t get large, and the learning rate won’t disappear!
RMSProp is still used in Reinforcement Learning, where in some cases is actually more stable than Adam.
- adaptive learning rate
Essentially Adam is a combination of Momentum and RMSProp. It has reduced oscillation, a more smoothed path, and adaptive learning rate capabilities. Combining those abilities makes it the most powerful and suitable for different problems optimizer.
The good starting configuration is learning rate 0.0001, momentum 0.9, and squared gradient 0.999.
This graphic perfectly sums up the pros and cons of each algorithm.
The pure SGD gets stock in the local minima.
The Momentum overshoots in both directions but finds the way to the global minimum.
The RMSProp moves along all dimensions smoothly.
And finally, Adam that overshoots a bit but moves steadily and the fastest out of all of them(you have to believe me on this one).
Check out this website if you want to play with these visualizations and try it for a different functions.
In this blog, we went through the five most popular optimizers in Deep Learning. Even though most of them are not used these days, analyzing the challenges they faced is crucial to deeply understand and appreciate Adam.
After reading this blog, I hope you got the intuition behind different optimization algorithms that help you explore that topic even further.
 Why sparse features should have bigger learning rates associated? And how Adagrad achieves this?
 An overview of gradient descent optimization algorithms