Sitemap

Optimization Algorithm: From SGD to Adam

7 min readNov 15, 2023

When training or fine-tuning large language models, the choice of optimization algorithm is crucial. Even with the same dataset and model architecture, using different optimization algorithms can lead to completely different training results.

Gradient descent is one of the most widely used optimization algorithms in neural networks. To overcome the limitations of naive gradient descent, researchers have invented a series of variant algorithms, evolving from the initial SGD (Stochastic Gradient Descent) to Adam and its variant algorithms.

This article aims to summarize the development history of deep learning optimization algorithms and provide an analysis and comparison of these typical algorithms.

SGD

Stochastic Gradient Descent (SGD) is the most classic optimization algorithm. It aims to minimize the loss function by decreasing each parameter in the direction of the gradient.

The update process is as follows:

Where:

Advantages:

  • Update model parameters with only one sample each time, fast training speed.
  • Helps the optimization direction jump from the current local minimum point to another better local minimum point, so that for non-convex functions, it eventually converges to a better local extreme point or even a global extreme point.

Disadvantages:

  • SGD is prone to converge to local optima and may get trapped in saddle points in certain cases.
  • Difficult to choose a suitable learning rate. A too low learning rate leads to slow convergence, while a too high learning rate causes excessive fluctuation during convergence.
  • All parameters use the same learning rate.

Momentum

SGD tends to oscillate when encountering ravines. To address this, momentum can be introduced to accelerate the descent of SGD in the correct direction and dampen oscillations. This is similar to the concept of momentum in physics: when we roll a ball down a hill without any resistance, its momentum keeps increasing. However, if there is resistance, the speed will decrease.

The idea of SGD with momentum is when updating parameters, the previous update direction is partially retained, while the final update direction is fine-tuned using the gradients of the current batch. In other words, momentum is accumulated from the past to accelerate the current gradient. This means that the parameter update direction is determined not only by the current gradient but also by the accumulated descent direction from the past.

This allows for accelerated updates in dimensions where the gradient direction does not change much, while reducing the update magnitude in dimensions with significant gradient direction changes and leads to accelerated convergence and reduced oscillations, as shown in Figure 1:

Figure 1

The update process is as follows:

where:

Advantages:

  • Momentum alleviates the problems of SGD where the gradient becomes 0 at local optima and cannot continue updating, as well as the problem of excessive oscillation.

Disadvantages:

  • When the local valleys are deep, if the momentum is exhausted, it can still get trapped in local optima and oscillate back and forth.

Adagrad

Regarding the issues with SGD and Momentum, the AdaGrad optimization algorithm (Adaptive Gradient) was introduced in 2011.

The SGD series does not use second-order momentum. The emergence of second-order momentum signifies the arrival of the era of “adaptive learning rate” optimization algorithms. SGD and its variants update each parameter with the same learning rate. However, deep neural networks often contain a large number of parameters, and not all of these parameters are always utilized (think about embeddings for LLMs).

For frequently updated parameters, we have accumulated a lot of knowledge about them and do not want them to be influenced too much by individual samples. Therefore, we hope to have a slower learning rate. For occasionally updated parameters, we have limited information about them and hope to learn more from each occurrence of a random sample. In other words, we desire a larger learning rate.

How to measure the frequency of historical updates? The method is second-order momentum, which means recording the sum of squares of all gradient values up to the present:

The update process is as follows:

The ε is used to prevent division by zero errors.

Compared to the parameter update method of SGD, Adagrad learning rate corresponds to adding a denominator that represents the second-order moment to η. For parameters that have been frequently updated in the past, the corresponding component of the second-order moment is larger, resulting in a smaller learning rate. This is called adaptation.

Advantages:

  • In scenarios with sparse data distribution, it can better utilize the information from sparse gradients and converge more effectively than the standard SGD algorithm.

Disadvantages:

  • It still requires manually setting a global learning rate η. If η is set too large, it can make the regularizer too sensitive and adjust the gradient too much.
  • In the later stages, the sum of squared gradients in the denominator will become larger, causing the parameter update to approach zero and leading to premature termination of training, preventing learning.

RMSProp

The problem with AdaGrad is that the learning rate constantly decays. This can cause the learning rate to decrease excessively before many tasks reach the optimal solution. Therefore, RMSprop uses exponential moving average to slowly discard previous gradient history. This prevents the learning rate from decreasing too early.

The method to calculate the exponential moving average of the squared gradient is as follows, where γ is the forgetting factor (or exponential decay rate), which is typically set to 0.9 based on empirical evidence:

The parameter updating method is similar to adagrad:

Advantages:

  • Avoids the continuous accumulation of second-order momentum, which overcomes the problem of rapid decrease in gradients in AdaGrad and premature termination of the training process. It shows excellent adaptive learning rate capabilities in many applications.
  • Especially in the case of non-stationary target functions, it performs better than basic SGD, Momentum, and AdaGrad.

Disadvantages:

  • It still requires manually setting a global learning rate η.

Adam

The Adam optimizer was proposed in December 2014, combining the advantages of the AdaGrad and RMSProp optimization algorithms. It takes into account the first-order moment estimation (mean of the gradients) and the second-order moment estimation (uncentered variance of the gradients) to calculate the update step.

The Adam method combines the aforementioned momentum and adaptive processes, dynamically adjusting both the gradients and the learning rate. If momentum can be seen as adding inertia to the optimization process, the adaptive process can be seen as adding resistance. The faster the speed, the greater the resistance.

The method first calculates the first-order moment estimation and the second-order moment estimation of the gradients, representing the original momentum and adaptive parts:

β1 and β2 are hyperparameters.

If mt and vt are initialized as zero vectors, they will be biased towards zero, so bias correction is performed to counteract these biases by calculating the bias-corrected mt and vt:

The parameter updating method is as follows:

Advantages:

  • Simple implementation, efficient computation, and low memory requirements.
  • Hyperparameters have good interpretability and often require little to no adjustment or only minor fine-tuning.
  • Very suitable for scenarios with large-scale data and parameters, unstable objective functions, and problems with sparse gradients or significant gradient noise.

Disadvantages:

  • Adam’s use of momentum with moving averages may lead to significant fluctuations as the training data changes. This can result in larger fluctuations in online scenarios, such as in advertising, where it is often not as effective as AdaGrad.

Overall, Adam is considered a highly efficient optimizer in many situations.

Conclusion

This article introduces the most popular and widely used optimizer and its advantages and disadvantages.

A few suggestions:

  • In practical applications, the choice of optimizer should be based on specific problems. Even with a thorough understanding of the data, it is still necessary to conduct sufficient parameter tuning experiments based on the characteristics of the data and algorithm to find the optimal solution.
  • If the data is sparse, use adaptive methods such as Adagrad, Adadelta, RMSprop, Adam.
  • We can do experiment with a small dataset to test the best optimization algorithm and search for the optimal training parameters through parameter tuning.
  • It is important to thoroughly shuffle the dataset to avoid biases in the descent direction caused by certain features being concentrated at different times when using adaptive learning rate algorithms.
  • Continuously monitor the objective function values, accuracy, AUC, and other metrics on both the training and validation data during the training process. Monitoring the training data ensures that the model has been sufficiently trained with the correct descent direction and a sufficiently high learning rate. Monitoring the validation data helps prevent overfitting.

Furthermore, the latest AI-related content can be found in my newsletter.

Finally, If there are any errors or omissions in this article, please kindly advise.

Reference Materials

An overview of gradient descent optimization algorithms

On the momentum term in gradient descent learning algorithms

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

ADAM: A METHOD FOR STOCHASTIC OPTIMIZATION

--

--

No responses yet