Deep Learning Book: Chapter 8 — Optimization For Training Deep Models Part II
This is going to be a series of blog posts on the Deep Learning book where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be the most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer this for more clarity on notation.
This is a continuation of our previous post on the first three sections of Chapter 8. The main concepts that we discussed were How learning differs from pure optimization, Challenges in Neural Network Optimization and the Basic algorithms used in optimization. The remaining sections of the chapter deal with:
1. Parameter Initialization Strategies
Training algorithms for deep learning models are iterative in nature and require the specification of an initial point. This is extremely crucial as it often decides whether or not the algorithm converges and if it does, then does the algorithm converge to a point with high cost or low cost.
We have limited understanding of neural network optimization but the one property that we know with complete certainty is that the initialization should break symmetry. This means that if two hidden units are connected to the same input units, then these should have different initialization or else the gradient would update both the units in the same way and we don’t learn anything new by using an additional unit. The idea of having each unit learn something different motivates random initialization of weights which is also computationally cheaper.
Biases are often chosen heuristically (zero mostly) and only the weights are randomly initialized, almost always from a Gaussian or uniform distribution. The scale of the distribution is of utmost concern. Large weights might have better symmetry-breaking effect but might lead to chaos (extreme sensitivity to small perturbations in the input) and exploding values during forward & back propagation. As an example of how large weights might lead to chaos, consider that there’s a slight noise adding ϵ to the input. Now, we if did just a simple linear transformation like W * x
, the ϵ noise would add a factor of W * ϵ to the output. In case the weights are high, this ends up making a significant contribution to the output. SGD and its variants tend to halt in areas near the initial values, thereby expressing a prior that the path to the final parameters from the initial values is discoverable by steepest descent algorithms. A more mathematical explanation for the symmetry breaking can be found in the Appendix.
Various suggestions have been made for appropriate initialization of the parameters. The most commonly used ones include sampling the weights of each fully-connected layer having m
inputs and n
outputs uniformly from the following distributions:
- U(-1 / √m, 1 / √m)
- U(- √6 / (m+n), √6 / (m+n))
These initializations have already been incorporated into the most commonly used Deep Learning frameworks nowadays so that you can just specify which initializer to use and the framework takes care of sampling appropriately. For e.g. Keras, which is a very famous deep learning framework, has a module called initializers, where the second distribution (among the 2 mentioned above) is implemented as glorot_uniform
.
One drawback of using 1 / √m as the standard deviation is that the weights end up being small when a layer has too many input/output units. Motivated by the idea to have the total amount of input to each unit independent of the number of input units m, Sparse initialization sets each unit to have exactly k non-zero weights. However, it takes a long time for GD to correct incorrect large values and hence, this initialization might cause problems.
If the weights are too small, the range of activations across the mini-batch will shrink as the activations propagate forward through the network.By repeatedly identifying the first layer with unacceptably small activations and increasing its weights, it is possible to eventually obtain a network with reasonable initial activations throughout.
The biases are relatively easier to choose. Setting the biases to zero is compatible with most weight initialization schemes except for a few cases for e.g. when used for an output unit, to prevent saturation at initialization or when using unit as a gate for making a decision. Refer to the chapter for details.
2. Algorithms with Adaptive Learning Rates
- AdaGrad: As mentioned in Part I , it is important to incrementally decrease the learning rate for faster convergence. Instead of manually reducing the learning rate after each (or several) epochs, a better approach is to adapt the learning rate as the training progresses. This can be done by scaling the learning rates of each model parameter individually inversely proportional to the square root of the sum of historical squared values of the gradient. In the parameter update equation below, r is initialized with 0 and the multiplication in the update step happens element-wise as mentioned. Since the gradient value would be different for each parameter, the learning rate is scaled differently for each parameter too. Thus, those parameters having a large gradient have a large decrease in the learning rate as the learning rate might be too high leading to oscillations or it might be approaching the minima but having large learning rate might cause it to jump over the minima as explained in the figure below, because of which the learning rate should be decreased for better convergence, while those with small gradients have a small decrease in the learning rate as they might have already approached their respective minima and should not be pushed away from that. Even if they have not, reducing the learning rate too much would reduce the gradient even further leading to slower learning.
However, accumulation of squared gradients from the very beginning can lead to excessive and premature decrease in the learning rate. Consider that we had a model with only 2 parameters (for simplicity) and both the initial gradients are 1000. After some iterations, the gradient of one of the
parameters has reduced to 100 but that of the other parameter is still around 750. However, because of the accumulation at each update, the accumulated gradient would still have almost the same value. For e.g. let the accumulated gradients at each step for the Parameter 1 be 1000 + 900 + 700 + 400 + 100 = 3100, 1/3100=0.0003
and that for Parameter 2 be: 1000 + 900 + 850 + 800 + 750 = 4300, 1/4300 = 0.0002
. This would lead to a similar decrease in the learning rates for both the parameters, even though the parameter having the lower gradient might have its learning rate reduced too much leading to slower learning.
- RMSProp: RMSProp addresses the problem caused by accumulated gradients in AdaGrad. It modifies the gradient accumulation step to an exponentially weighted moving average in order to discard history from the extreme past. The RMSProp update is given by:
This allows the algorithm to converge rapidly after finding a convex bowl, as if it were an instance of AdaGrad initialized within that bowl. Let me explain why this is so. Consider the figure below. The region represented by 1 indicates usual RMSProp parameter updates as given by the update equation, which is nothing but exponentially averaged AdaGrad updates. Once the optimization process lands on A, it essentially lands at the top of a convex bowl. At this point, intuitively, all the updates before A can be seen to be forgotten due to the exponential averaging and it can be seen as if (exponentially averaged) AdaGrad updates start from point A onwards.
- Adam: Adapted from “adaptive moments”, it focuses on combining RMSProp and Momentum. Firstly, it views Momentum as an estimate of the first-order moment and RMSProp as that of the second moment. The weight update for Adam is given by:
Secondly, since s and r are initialized as zeros, the authors observed a bias during the initial steps of training thereby adding a correction term for both the moments to account for their initialization near the origin. As an example of what the effect of this bias correction is, we’ll look at the values of s and r for a single parameter (in which case everything is now represented as a scalar). Let’s first understand what would happen if there was no bias correction. Since s (notice that this is not in bold as we are looking at the value for a single parameter and the s here is a scalar) is initialized as zero, after the first iteration, the value of s would be (1 — ρ1) * g and that of r would be (1 — ρ2) * g². The preferred values for ρ1 and ρ2 are 0.9 and 0.99 respectively. Thus, the initial values of s and r are pretty small and this gets compounded as the training progress. However, if we now use bias correction, after the first iteration, the value of s is just g and that of r is just g². This gets rid of the bias that occurs in the initial phase of training. A major advantage of Adam is that it’s fairly robust to the choice of these hyperparameters, i.e. ρ1 and ρ2.
The figure below shows the comparison between the various optimization methods discussed above. It can be clearly seen that algorithms with adaptive learning rates provide faster convergence:
3. Approximate Second-Order Methods
The optimization algorithms that we’ve looked at till now involved computing only the first derivative. But there are many methods which involve higher order derivatives as well. The main problem with these algorithms are that they are not practically feasible in their vanilla form and so, certain methods are used to approximate the values of the derivatives. We explain three such methods, all of which use empirical risk as the objective function:
- Newton’s Method: This is the most common higher-order derivative method used. It makes use of the curvature of the loss function via its second-order derivative to arrive at the optimal point. Using the second-order Taylor Series expansion to approximate J(θ) around a point θo and ignoring derivatives of order greater than 2 (this has already been discussed in previous chapters), we get:
We know that we get a critical point for any function f(x) by solving for f'(x) = 0
. We get the following critical point of the above equation (refer to the Appendix for proof):
For quadratic surfaces (i.e. where cost function is quadratic), this directly gives the optimal result in one step whereas gradient descent would still need to iterate. However, for surfaces that are not quadratic, as long as the Hessian remains positive definite, we can obtain the optimal point through a 2-step iterative process — 1) Get the inverse of the Hessian and 2) update the parameters.
Saddle points are problematic for Newton’s method. If all the eigenvalues are not positive, Newton’s method might cause the updates to move in the wrong direction. A way to avoid this is to add regularization:
However, if there is a strong negative curvature i.e. the eigenvalues are largely negative, α needs to be sufficiently high to offset the negative eigenvalues in which case the Hessian becomes dominated by the diagonal matrix. This leads to an update which becomes the standard gradient divided by α:
Another problem restricting the use of Newton’s method is the computational cost. It takes O(k³) time to calculate the inverse of the Hessian where k is the number of parameters. It’s not uncommon for Deep Neural Networks to have about a million parameters and since the parameters are updated every iteration, this inverse needs to be calculated at every iteration, which is not computationally feasible.
- Conjugate Gradients: One weakness of the method of steepest descent (i.e. GD) is that line searches happen along the direction of the gradient. Suppose the previous search direction is d(t-1). Once the search terminates (which it does when the gradient along the current gradient direction vanishes) at the minimum, the next search direction, d(t) is given by the gradient at that point, which is orthogonal to d(t-1) (because if it’s not orthogonal, it’ll have some component along d(t-1) which cannot be true as at the minimum, the gradient along d(t-1) has vanished).
Upon getting the minimum along the current search direction, the minimum along the previous search direction is not preserved, undoing, in a sense, the progress made in previous search direction.
In the method of conjugate gradients, we seek a search direction that is conjugate to the previous line search direction:
with d(t) and d(t-1) being conjugates if d(t)' H d(t-1) = 0
. βt decides how much of d(t-1) is added back to the current search direction. There are two popular choices for βt — Fletcher-Reeves and Polak-Ribière. These discussions assumed the cost function to be quadratic where the conjugate directions ensure that the gradient along the previous direction does not increase in magnitude. To extend the concept to work for training neural networks, there is one additional change. Since it’s no longer quadratic, there’s no guarantee anymore than the conjugate direction would preserve the minimum in the previous search directions. Thus, the algorithm includes occasional resets where the method of conjugate gradients is restarted with line search along the unaltered gradient.
- BFGS: This algorithm tries to bring the advantages of Newton’s method without the additional computational burden by approximating the inverse of H by M(t), which is iteratively refined using low-rank updates. Finally, line search is conducted along the direction M(t)g(t). However, BFGS requires storing the matrix M(t) which takes O(n²) memory making it infeasible. An approach called Limited Memory BFGS (L-BFGS) has been proposed to tackle this infeasibility by computing the matrix M(t) using the same method as BFGS but assuming that M(t−1) is the identity matrix.
4. Optimization Strategies and Meta-Algorithms
- Batch Normalization: Batch normalization (BN) is one of the most exciting innovations in Deep learning that has significantly stabilized the learning process and allowed faster convergence rates. The intuition behind batch normalization is as follows: Most of the Deep Learning networks are compositions of many layers (or functions) and the gradient with respect to one layer is taken considering the other layers to be constant. However, in practise all the layers are updated simultaneously and this can lead to unexpected results. For example, let y* = x W¹ W² … W¹⁰. Here, y* is a linear function of x but not a linear function of the weights. Suppose the gradient is given by g and we now intend to reduce y* by 0.1. Using first-order Taylor Series approximation, taking a step of ϵg would reduce y* by ϵg’ g. Thus, ϵ should be 0.1/(g’ g) just using the first-order information. However, higher order effects also creep in as the updated y* is given by:
An example of a second-order term would be ϵ² g1 g2 ∏ wi. ∏ wi can be negligibly small or exponentially high depending on whether the individual weights are less than or greater than 1. Since the updates to one layer is so strongly dependent on the other layers, choosing an appropriate learning rate is tough. Batch normalization takes care of this problem by using an efficient reparameterization of almost any deep network. Given a matrix of activations, H, the normalization is given by: H’ = (H-μ) / σ, where the subtraction and division is broadcasted.
Going back to the earlier example of y*, let the activations of layer l be given by h(l-1). Then h(l-1) = x W1 W2 … W (l-1). Now, if x is drawn from a unit Gaussian, then h(l-1) also comes from a Gaussian, however, not of zero mean and unit variance, as it is a linear transformation of x. BN makes it zero mean and unit variance. Therefore, y* = Wl h(l-1) and thus, the learning now becomes much simpler as the parameters at the lower layers mostly do not have any effect. This simplicity was definitely achieved by rendering the lower layers useless. However, in a realistic deep network with non-linearities, the lower layers remain useful. Finally, the complete reparameterization of BN is given by replacing H with γH’ + β. This is done to retain its expressive power and the fact that the mean is solely determined by XW. Also, among the choice of normalizating X or XW + B, the authors recommend the latter, specifically XW, since B becomes redundant because of β. Practically, this means that when we are using the Batch Normalization layer, the biases should be turned off. In a deep learning framework like Keras, this can be done by setting the parameter use_bias=False in the Convolutional layer.
- Coordinate Descent: Generally, a single weight update is made by taking the gradient with respect to every parameter. However, in cases where some of the parameters might be independent (discussed below) of the remaining, it might be more efficient to take the gradient with respect to those independent sets of parameters separately for making updates. Let me clarify that with an example. Suppose we have the following cost function:
This cost function describes the learning problem called sparse coding. Here, H refers to the sparse representation of X and W is the set of weights used to linearly decode H to retrieve X. An explanation of why this cost function enforces the learning of a sparse representation of X follows. The first term of the cost function penalizes values far from 0 (positive or negative because of the modulus, |H|, operator. This enforces most of the values to be 0, thereby sparse. The second term is pretty self-explanatory in that it compensates the difference between X and H being linearly transformed by W, thereby enforcing them to take the same value. In this way, H is now learned as a sparse “representation” of X. The cost function generally consists of additionally a regularization term like weight decay, which has been avoided for simplicity. Here, we can divide the entire list of parameters into two sets, W and H. Minimizing the cost function with respect to any of these sets of parameters is a convex problem. Coordinate Descent (CD) refers to minimizing the cost function with respect to only 1 parameter at a time. It has been shown that repeatedly cycling through all the parameters, we are guaranteed to arrive at a local minima. If instead of 1 parameter, we take a set of parameters as we did before with W and H, it is called block coordinate descent (the interested reader should explore Alternating Minimization). CD makes sense if either the parameters are clearly separable into independent groups or if optimizing with respect to certain set of parameters is more efficient than with respect to others.
Coordinate descent may fail terribly when one variable influences the optimal value of another variable.
- Polyak Averaging: Polyak averaging consists of averaging several points in the parameter space that the optimization algorithm traverses through. So, if the algorithm encounters the points θ(1), θ(2), … during optimization, the output of Polyak averaging is:
The figure below explains the intuition behind Polyak averaging:
Most optimization problems in deep learning are non-convex where the path taken by the optimization algorithm is quite complicated and it might happen that a point visited in the distant past might be quite far from the current point in the parameter space. Thus, including such a point in the distant past might not be useful, which is why an exponentially decaying running average is taken. This scheme where the recent iterates are weighted more than the past ones is called Polyak-Ruppert Averaging:
- Supervised Pre-training: Sometimes it’s hard to directly train to solve for a specific task. Instead it might be better to train for solving a simpler task and use that as an initialization point for training to solve the more challenging task. As an intuition for why this seems logical, consider that you didn’t have any background in integration and were asked to learn how to compute the following integral:
If you’re anyone close to a normal person, your first reaction would be:
However, wouldn’t it be better if you were asked to understand the more basic integrations first:
I hope you understand what I meant with this example — learning a simpler task would put you in a better position to understand the more complex task. This particular strategy of training to solve a simpler task before facing the herculean one is called pretraining. A particular type of pretraining, called greedy supervised pretraining, firstly breaks a given supervised learning problem into simpler supervised learning ones and solving for the optimal version of each component in isolation. To build on the above intuition, the hypothesis as to why this works is that it gives better guidance to the intermediate layers of the network and helps in both, generalization and optimization. More often that not, the greedy pretraining is followed by a fine-tuning stage where all the parts are jointly optimized to search for the optimal solution to the full problem. As an example, the figure below shows how each hidden layer is trained one at a time, where the input to the hidden layer being learned is the output of the previously trained hidden layer.
Also, FitNets shows an alternative way to guide the training process. Deep networks are hard to train mainly because as deeper the model gets, more non-linearities are introduced. The authors propose the use of a shallower and wider teacher network that is trained first. Then, a second network which is thinner and deeper, called the student network is trained to predict not only the final outputs but also the intermediate layers of the teacher network. For those who might not be clear with what deep, shallow, wide and thin might mean, refer the following diagram:
The idea is that predicting the intermediate layers of the teacher network provides some hints as to how the layers of the student network should be used and aids the optimization procedure. It was shown that without the hints to the hidden layers, the students networks performs poorly in both the training and test data.
- Designing Models to Aid Optimization: Most of the work in deep learning has been towards making the models easier to optimize rather than designing a more powerful optimization algorithm. This is evident from the fact that Stochastic Gradient Descent, which is primarily used for training deep models today, has been in use since the 1960s. Many of the current design choices lend towards using linear transformations between layers and using activation functions like ReLU [max(0, x)] which are linear for the most part and enjoy large gradients as compared to sigmoidal units which saturate easily. Also, linear functions increase in a particular direction. Thus, if there’s an error, there’s a clear direction towards which the output should move to minimize the error.
Residual connections reduce the length of the shortest path from the output to the lower layers, thereby allowing a larger gradient to flow through and hence, tackling the vanishing gradient problem. Similarly, GoogleNet attached multiple copies of the output to intermediate layers so that a larger gradient flows to those layers. Once training is complete, the auxiliary heads are removed.
- Continuation Methods and Curriculum Learning: Based on the explanation given in the book (and also given that it’s fairly short), I’d recommend the interested reader to refer the book directly. The reason being that I realized that the content there was exactly to the point with no further scope of summarizing and hence, found it redundant to explain it again here. To give a very brief idea, Continuation Methods are a family of strategies where instead of a single cost function, a series of cost functions are optimized to aid optimization by making sure that the algorithm spends most of the time in well-behaved regions in the parameter space. The series of cost functions are designed such that the cost functions become progressively harder to solve, i.e. the first cost function is easiest to solve and the final cost function (which is actually the cost function we wanted to minimize) is the toughest, and the solution to each easier cost function serves as a good initialization point for the next, harder-to-optimize cost function. Curriculum Learning can be interpreted as a type of continuation method where the learning process is planned such that the simpler concepts are learned first and then the harder ones which depend on those simpler concepts. This concept had been shown to accelerate progress in animal training. One simple way to use it in the context of machine learning is to initially show the algorithm more of the easier images and less of the harder images. As the learning progresses, the proportion of easier images is reduced and that of the harder images is increased so that the model can ultimately learn the harder task.
This concludes the 2nd part of our summary for Chapter 8. This has been one of the most math intensive chapters that I have ever read and if you’ve made it this far, I have two things for ya: Thank You for taking the time out and Congratulations, as hopefully you have a much better clarity on what goes under the hood. Since Medium doesn’t support math symbols yet, a few of the mathematical notations were not proper and if you’re looking for a version with better notation, feel free to have a look over our repository here which contains Jupyter Notebooks entailing the same content but with the symbols converted to LaTex. Our next post will be about Chapter 11: Practical Methodology which focuses upon practical tips for making your Deep Learning model work. To stay updated with our posts, follow our publication, Inveterate Learner. Finally, I thank my co-author, Ameya Godbole, for thoroughly reviewing the post and suggesting many important changes. A special thanks to Raghav Somani for providing an in-depth review from a theoretical perspective that played an important role in the final shaping of this post.
Patience is a virtue and I’m learning patience. It’s a tough lesson.
- Elon Musk