Recap of Stochastic Optimization in Deep Learning (part 1)
1. Introduction
This review is written for AI or Machine Learning practitioners who need a quick refresh of the topic. Actually, any person who has some preliminaries can read it, but it is not a tutorial on deep learning optimization. I have tried to write it in a concise way, but also to pay attention to important details which are encountered in practice and useful for understanding some concepts. I decided to write it because I, as an Applied Scientist, had to refresh these things from time to time. So, I decided to summarize it in one document and share it.
It consists of two parts. The first part covers basics of stochastic gradient optimization, Exponentially Weighted Averages (EWA), and two more advanced algorithms: Root Mean Square Propagation (RMSprop) and Adaptive Moment Estimation Algorithm (Adam). In the second part, I would like to review the most recent algorithms like AdamW³, Lookahead optimizer⁴ — the latest optimizer from Geoffrey Hinton’s group, L4 optimizer², and a couple of others.
In the following text, vectors are denoted in bold font. Since there is no consistent way to write subscripts on Medium, sometimes I had to use “_” (underscore) notation to write a subscript e.g. v_t, hopefully, it is not very annoying.
2. A brief recap of the (stochastic) gradient optimization
We consider a problem of optimizing a scalar function ℓ(w, D) which depends on a parameters vector w and dataset D. The simplest algorithm to optimize this function is the gradient descent algorithm:
The gradient dℓ/dw directs to the direction of maximal function growth, so to minimize the function we modify the parameters in the opposite direction of the gradient. α is the learning rate (LR) and it is the most important parameter in the deep learning optimization algorithms. It must be set by the user.
In general (batch) optimization literature there are many ways to automatically choose the learning rate and to go in a direction that is better than the gradient. However, in deep learning, those are not applicable for one or another reason. Any textbook on optimization covers the batch gradient optimization. Gradient optimization is also called first order optimization because only the first derivative of a cost function is used.
In deep learning and in machine learning the cost function may often be expressed as a sum of terms. Each term corresponds to one data point. For example, if a have a dataset
then the loss function typically is:
Since datasets are typically large in deep learning, computing the whole loss function on each optimization step is too costly. Hence, a dataset is usually shuffled and split into the mini-batches of size B. Then, for each mini-batch with index b we can compute the cost which corresponds to this mini-batch
is a set of indices of the original dataset which belong to this mini-batch. Mini-batch size B is the hyper-parameter of an optimization algorithm.
When in the gradient descent instead of the complete loss function ℓ we use the only one mini-batch loss ℓ_B the algorithms is called Stochastic Gradient Descent (SGD):
It is called stochastic exactly because the gradient is not computed exactly, but using only part of the data. This results in stochasticity in gradient computation. Every deep learning textbook has a chapter about his algorithm, and the internet is full of information. One interesting reference can be this one¹.
In the following, we drop the subscript b from a mini-batch loss function and denote it as ℓ, but we assume that it is the loss of one mini-batch.
Now, let’s recap the important concept which is relevant for all optimization algorithms.
3. Exponentially Weighted Averages
Assume that we have a sequential gradient estimates g₁, g₂, g₃ … and we want to average it in a way to give more weight to the latest values. Also, we want to do it online i.e. to compute the average right after the new value g_t arrives. One way to do it is by using Exponentially Weighted Averages (EWA). It is a general procedure and can be applied to any sequence, not only gradients. Denote the average which averages (with certain weights) g₁, g₂, …, g_t as v_t. So, we can put v₀ = 0 and use the following recursive update, with parameter β < 1:
Let’s see how the v₄ is computed:
We see that the earlier the value is the lower the weight it receives because β < 1.
This averaging makes time series smoother (see figure below):
4. Bias correction
There is a better, bias corrected, form of the Exponentially Weighted Average (EWA). First, consider what is bias. Suppose that time series is a random process and its every value has a mean E[gᵢ] = a. Then, consider the EWA expansion in Eq. (5):
So, in general form it is:
Now we see that if we redefine $v_t$ such that:
Then, v_t on each step is unbiased i.e. E[v_t] = a, so the mean equals the mean of the time series itself.
Figure (1)(red curve) shows the bias-corrected EWA. The difference with the green curve appears only in the beginning because bias decreases with time. Both versions are used in practical optimization algorithms, however, the bias-corrected version is more preferable in general.
5. Effective averaging length
Equations (4) and (8) define the formulae for Exponentially Weighted Average. EWA is a weighted average of all the time steps before the current value v_t. Weighs of past values depend on the parameter β. For practical purposes, it is often necessary to estimate how many past values actually influence the current value. So, the question is: how many past values effectively influence the v_t? The values far in the past have a very small coefficient in front of them, so their influence is negligibly small. In physics and other domains it is frequently supposed that the influence is small when the value decreases e times, so in our case:
Let’s simplify this, using the properties of logarithms:
To obtain even simpler expression which allows counting in our head we can expand logₑβ into the Taylor series at value 1:
Note, that we had to expand only the denominator since the whole function t(β) can not be expanded (derivatives exist neither at 0 nor at 1).
Finally, we have a simple formula. The effective amount of past time steps which influence the correct values is
6. SGD with momentum
Stochastic Gradient Descent (SGD) with momentum is an improved version of plain SGD from Section 2. The main idea is to hold Exponentially Weighted Average (EWA) of the gradient vector and to use this EWA to update the weights on every iteration. Mathematically this can be written as:
In the equation above: β is the momentum parameter and α is the learning rate. Often the default value of the momentum is 0.9.
No bias correction (Sec. 4) is typically used for the momentum SGD because it is ok to have smaller gradient estimation values during several first iterations.
Typically, SGD with momentum works better than plain SGD because in plain SGD gradient estimation may vary significantly from one mini-batch to another. Exponentially Weighted Average of the gradient smooths out this variability and estimations of the gradient become more consistent.
Also, in some implementations the factor (1-β)
is missing in front of the gradient. However, in this case, v_t is scaled by the (1-β) and the learning rate has to be adjusted every time the momentum changes. Let’s explore in more detail in the section below.
A different form of SGD with momentum
If we expand the recursion in standard Stochastic Gradient Descent form (Eq. (4)), then we get the expression similar to the one bias correction derivation in Eq. (6):
The gradient g₀ cancels because it is assumed by convention.
Furthermore, if we assume that we are optimizing linear function then all gradients are the same g_k = g. This assumption is illustrative but conclusions can be applied also to nonlinear functions. We apply the geometric progression to obtain:
Now, if we suppose that in momentum implementation there is no (1-β) term, then the recursive expansion of gradients goes exactly as in Eq. (12), except there is no (1-β) in front of the second term. So we obtain:
If we now insert the last expression into the SGD update we will get:
So, we see, that because the gradient does not converge to g the effective learning rate changes. Therefore, if this form of momentum is used then every time the momentum parameter is changed the learning rate should be modified as well. They become coupled. It does not happen if the momentum form as in Eq. (11) is used.
This coupled form of SGD is implemented e.g in Pytorch library: https://github.com/pytorch/pytorch/blob/master/torch/optim/sgd.py. It must be taken into account in practice.
7. Root-mean square propagation (RMSprop)
The idea of the Root Mean Square Propagation (RMSprop) is to adaptively change the learning rate so that it is smaller for the parameters with large gradients and larger for the parameters with small gradients. To achieve that, the EWA of the element-wise square of the gradients is computed. The full mathematical expression of the RMSprop is below:
Note, that when the gradient vector g_t is squared the square is element-wise. Square root and division of vectors are element-wise operations as well. Then s_t is used in the denominator so the effective learning rate is smaller for larger s_t. The EWA parameter is denoted as β₂ to distinguish it from the regular momentum parameter β.
The constant ε prevents from the division by zero and is set to a small number e.g. 10e-8.
8. Adaptive moment estimation algorithm (ADAM)
Adaptive Moment Estimation Algorithm (Adam) is the most popular stochastic gradient optimization algorithm these days. Adam is the combination of the previous two algorithms: SGD with momentum and the RMSprop. Its precise formulation is the following:
Adam is an adaptive learning rate algorithm as well. The reason is that squared gradients in the denominator may be considered as factors modulating the learning rate. If they are large, the learning rate becomes small and vise versa. Hence, Adam is able to adjust the learning rate for every parameter.
9. Conclusion
In this part, we have considered the main concepts of modern stochastics optimizers. The core concept is the EWA which serves as a basis for SGD with momentum, RMSprop, and Adam. In the second part, we will consider the latest algorithms AdamW³, Lookahead⁴, L4 optimizer². I am looking forward to your comments and questions!
Bibliography:
¹Leon Bottou, Stochastic Gradient Descent Tricks, Neural Networks, Tricks of the Trade, Reloaded, 2012
²Michal Rolinek and Georg Martius, L4: Practical loss-based stepsize adaptation for deep learning, NeurIPS 2018.
³Ilya Loshchilov and Frank Hutter, Decoupled Weight Decay Regularization,
ICLR 2019.
⁴Michael R. Zhang and James Lucas and Geoffrey E. Hinton and Jimmy Ba,
Lookahead Optimizer: k steps forward, 1 step back, http://arxiv.org/abs/1907.08610
⁵Andrew Ng, Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization, Coursera.com, https://www.coursera.org/learn/deep-neural-network/home/welcome