A 2021 Guide to improving CNNs-Optimizers: Adam vs SGD
This will be my third post on my series A 2021 Guide to improving CNNs.
Optimizers
Optimizers can be explained as a mathematical function to modify the weights of the network given the gradients and additional information, depending on the formulation of the optimizer. Optimizers are built upon the idea of gradient descent, the greedy approach of iteratively decreasing the loss function by following the gradient.
Such functions can be as simple as subtracting the gradients from the weights, or can also be very complex.
Better optimizers are mainly focused on being faster and efficient but are also often known to generalize well(less overfitting) compared to others. Yes, it is possible that the choice of optimizer can dramatically influence the performance of the model.
We will review the components of the commonly used Adam optimizer. We will also discuss the debate on whether SGD generalizes better than Adam-based optimizers. Finally, we will review some papers that compare the performance of such optimizers and make a conclusion about optimizer selection. One thing to note is that designing optimizers that improve the practical convergence speed and can generalize well on various settings is very challenging.
~Adam
Vanilla GD (SGD)
Precisely, stochastic gradient descent(SGD) refers to the specific case of vanilla GD when the batch size is 1. However, we will consider all mini-batch GD, SGD, and batch GD as SGD for convenience in this post.
SGD is the most basic form of GD. SGD subtracts the gradient multiplied by the learning rate from the weights. Despite its simplicity, SGD has strong theoretical foundations and is still used in training edge NNs.
Momentum
Momentum is often referred to as rolling down a ball, as it is conceptually equal to adding velocity. The weights are modified through a momentum term, which is calculated as the moving average of gradients. The momentum term γ can be seen as air resistance or friction which decays the momentum proportionally. Momentum accelerates the training process but adds an additional hyperparameter.
Essentially, this equation is equal to subtracting the exponential decaying average of gradients: θ-=α(d_i+d_(i-1)γ+d_(i-2)γ²+d_(i-3)γ³+…)
RMSProp
RMSProp is an unpublished work, essentially similar to momentum. If the gradients are consistently large, the values of v_i will increase, and the learning rate will decrease. This adaptively adjusts the learning rate for each parameter and enables the usage of larger learning rates.
Adam[10]
Adam essentially combines RMSProp and momentum by storing both the individual learning rate of RMSProp and the weighted average of momentum. The momentum and RMSProp parameters are calculated as the equation below.
The parameters are divided by (1-decay factor) before being applied to the weights in the gradient descent step.
As in the equation above, Adam is based on RMSProp but estimates the gradient as the momentum parameter to improve training speed. According to the experiments in [10], Adam outperformed all other methods in various training setups and experiments in the paper. Adam has become a default optimization algorithm regardless of fields. However, Adam introduces two new hyperparameters and complicates the hyperparameter tuning problem.
SGD is better?
One interesting and dominant argument about optimizers is that SGD better generalizes than Adam. These papers argue that although Adam converges faster, SGD generalizes better than Adam and thus results in improved final performance.
Better “stability” of SGD[12]
[12] argues that SGD is conceptually stable for convex and continuous optimization. First, it argues that minimizing training time has the benefit of decreasing generalization error. This is because the model will not see the same data several times, and the model wouldn’t be able to simply memorize the data without generalization ability. This seems like a reasonable argument.
The paper proposes to make a generalization error, the difference between training and validation error of models learned through SGD. An algorithm is uniformly stable if the training error varies only slightly for any change on a single training data point. The stability of the model is related to the generalization error. The paper shows mathematical proof to show that SGD is uniformly stable for strongly convex loss functions, and thus might have optimal generalization error. The paper also shows that the results can be carried over to non-convex loss functions in conditions where the number of iterations is not too large.
Examples of such cases(theoretical+empirical)[9]
[9] suggests the problem of adaptive optimization methods(e.g. RMSProp, Adam) in simple over-parameterized experiments and suggests more empirical evidence of the poor generalization performance of such adaptive optimization strategies. It also shows that adaptive and non-adaptive optimization methods indeed find very different solutions with very different generalization properties theoretically.
First, the observation that when a problem has multiple global minima, different algorithms can find entirely different solutions when initialized from the same point is discussed, and construct a theoretical example where adaptive gradient methods find a solution that is worse than SGD. In short, non-adaptive methods including SGD and momentum will converge towards a minimum norm solution in a binary least-square classification loss task while adaptive methods can diverge.
The paper also suggests four empirical experiments using deep learning. The paper suggests that their experiments show the following findings:
- Adaptive methods find solutions that generalize worse than those found by non-adaptive methods.
- Even when the adaptive methods achieve the same training loss or lower than non-adaptive methods, the test performance is worse.
- Adaptive methods often display faster initial progress on the training set, but their performance quickly plateaus on the validation set.
- Though conventional wisdom suggests that Adam does not require tuning, we find that tuning the initial learning rate and decay scheme for Adam yields significant improvements over its default settings in all cases.
These papers demonstrate that adaptive optimization is fast at the initial stages of training but often fails to generalize to validation data. This is very interesting since the relative orders are different case-by-case while SGD outperforms all other methods in most cases for the validation set.
Maybe not? [8]
A recent paper suggests that the hyperparameter could be the reason that adaptive optimization algorithms failed to generalize. The experiments in [8] show different results from the papers above when hyperparameter search spaces are changed.
This actually makes obvious sense since more general optimizers(e.g. Adam) could approximate more simple component-optimizers(e.g. Momentum, SGD, RMSProp) by different hyperparameter selection and therefore should not be worse than its components. This paper argues that the hyperparameter search spaces used to suggest empirical evidence that SGD is better were too shallow and unfair for adaptive methods. Therefore, the experiments were done over a relatively large search space(Appendix D of [8]).
As a result, the fine-tuned adaptive optimizers were faster compared to standard SGD and did not lag behind in terms of generalization performance. Every value of the optimal hyperparameter was away from the search space boundaries for all optimizers, thus suggests that the search space was appropriate. Interestingly, the optimal hyperparameters for the optimizers varied largely between datasets.
The authors confidently say that
In particular, we find that the popular adaptive gradient methods never underperform momentum or gradient descent.
The main finding of this paper is that by tuning all available hyperparameters at scales in deep learning, more general optimizers never underperform their special cases. In particular, they observe that RMSProp, Adam, and NAdam never underperformed SGD, NESTEROV, or Momentum. Although there are some limitations that the experiments were done with some potentially confounding settings(e.g. didn’t tune batch size, specific tuning protocol) the message stated was alerting and interesting.
For now, we could say that fine-tuned Adam is always better than SGD, while there exists a performance gap between Adam and SGD when using default hyperparameters.
References
[1] Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
[2] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).
[3] Zhang, M. R., Lucas, J., Hinton, G., & Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. arXiv preprint arXiv:1907.08610.
[4] Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.
[5] Zhuang, J., Tang, T., Ding, Y., Tatikonda, S., Dvornek, N., Papademetris, X., & Duncan, J. S. (2020). Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468.
[6] Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
[7] Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628.
[8] Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., & Dahl, G. E. (2019). On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446.
[9] Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292.
[10] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
[11] You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., … & Hsieh, C. J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.
[12] Hardt, M., Recht, B., & Singer, Y. (2016, June). Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (pp. 1225–1234). PMLR.