Deep Learning Book: Chapter 8— Optimization For Training Deep Models Part I

Aman Dalmia
Inveterate Learner
Published in
14 min readJun 16, 2018

This is going to be a series of blog posts on the Deep Learning book where we are attempting to provide a summary of each chapter highlighting the concepts that we found to be the most important so that other people can use it as a starting point for reading the chapters, while adding further explanations on few areas that we found difficult to grasp. Please refer this for more clarity on notation.

There are many types of optimization problems involved in deep learning, with the toughest one being that of training a neural network. The main theme of the chapter is to focus on one type of optimization — finding the parameters θ that reduce a cost function J(θ).

Since this is a very math intensive chapter, in order to make concepts crystal clear, explanations are required at a lot of places. To avoid making this post too long, we are presenting the summary in two parts. We do not want to compromise on the main goal which is to make Deep Learning accessible to the largest audience possible. The sections of the chapter presented in this part are listed below:

1. How Learning Differs from Pure Optimization

  • In Machine Learning (ML), we care about a certain performance measure (say P, for e.g. accuracy) defined w.r.t the test set and optimize J(θ) (for e.g. cross-entropy loss) with the hope that it improves P as well. In pure optimization, optimizing J(θ) is the final goal.
  • The expected generalization error (risk) is taken over the true data-generating distribution p_data. If we do have that, it becomes an optimization problem.
Notice how the expectation is taken over the true data generating distribution.

When we don’t have p_data but a finite training set, we have a ML problem. The latter can be converted back to an optimization problem by replacing p_data with the empirical distribution with p̂_data obtained from the training set, thereby reducing the empirical risk. This is called empirical risk minimization (ERM):

Notice the change in distribution over which the expectation is taken.

Although this might look relatively similar to optimization, there are two main problems. Firstly, ERM is prone to overfitting with the possibility of the dataset being learned by high capacity models (models with the ability to learn extremely complex functions). Secondly, ERM might not be feasible. Most optimization algorithms now are based on Gradient Descent (GD) which requires a derivative calculation and hence, may not work with certain loss functions like the 0–1 loss (as it is not differentiable).

  • It is for the reasons mentioned above that a surrogate loss function (SLF) is used instead, that acts as a proxy. For e.g. the negative log-likelihood of the true class is used as a surrogate for 0–1 loss. I’ve added a code snippet below that would help you understand why 0–1 loss won’t work for Gradient Descent but cross-entropy, being a smooth function, would.
It can be seen that the 0–1 loss is a non-differentiable function and hence, not compatible with gradient-based algorithms like Gradient Descent. Cross-entropy is a smooth approximation of the 0–1 loss.

Using a SLF might even turn out to be beneficial as you can keep continuing to obtain a better test error by pushing the classes even further apart to get a more reliable classifier. By this, I mean that suppose we were using a 0–1 loss with a threshold of, say, 0.5 to assign each class. Here, in case the true class is 1, our model would have no motivation to push the prediction score close to 1 once it’s able to get it above 0.5. However, using cross-entropy, since we are using the raw prediction scores, the model keeps trying to push the prediction closer to the true class.

  • Another common difference is that training might be halted following some convergence criterion based on Early Stopping to prevent overfitting, when the derivative of the surrogate loss function might still be large. This is different from pure optimization which is halted only when the derivative becomes very small. If you’re not familiar with Early Stopping, I’d recommend you to look at our previous post where we talk about Early Stopping and other regularization techniques.
  • In ML optimization algorithms, the objective function decomposes as a sum over the examples and we can perform updates by randomly sampling a batch of examples and taking the average over the examples in that batch. If we consider n random variables, each having the true mean given by μ, the Standard Error (S.E.) of the mean estimated from those n random variables(see Appendix for derivation) is given as follows:

This indicates that as we include more examples for making an update, the returns of additional examples in improving the error is less than linear. Thus, if we use 100 and 10000 examples separately to make an update, the latter takes 100 times more compute, but reduces the error only by a factor of 10. Thus, it’s better to compute rapid approximate updates rather than a slow exact update.

  • There are 3 types of sampling based algorithms — batch gradient descent (BGD), where the entire training set is used to make a single update, stochastic gradient descent (SGD), where a single example is used to make a weight update and mini-batch gradient descent (MBGD), where a batch (not to be confused with BGD) of examples is randomly sampled from the entire training set and is used to make an update. Mini-batch GD is nowadays commonly referred to as Stochastic GD.
Taken from Andrew Ng’s Machine Learning course. Here, as mentioned above, mini-batch gradient descent has been presented as stochastic gradient descent.

It is a common practise to use batch sizes of powers of 2 to offer better run-time with certain hardware. Small batches tend to have a regularizing effect because of the noise they inject as each update is made by seeing only a very small portion of the entire training set, a.k.a. a batch of samples.

  • The mini-batches should be selected randomly. It is sufficient to shuffle the dataset once and iterate over it multiple times. In the first epoch, the network sees each example for the first time and hence, the estimate of gradient is an unbiased estimate of the gradient of the true generalization error. However, from the second epoch onward, the estimate becomes biased as it is re-sampling from data that it has already seen. Such a sampling algorithm is called Random Reshuffling and although their analysis even for generalized linear models, which are strongly convex, is an open problem till date, reasonable efforts have been made to show that this biased estimate of the gradient is decent enough (at least as good as i.i.d. based Batch SGD).

2. Challenges in Neural Network Optimization

The optimization problem for training neural networks is generally non-convex. Some of the challenges faced are mentioned below:

  • Ill-conditioning of the Hessian Matrix: The Hessian matrix and condition number have been covered in our summary for Chapter 4. For the sake of completion, the Hessian matrix H of a function f with a vector-valued input x is given as:
Source: https://en.wikipedia.org/wiki/Hessian_matrix

The ill-conditioning of H (defined below) can cause the SGD to get stuck in a sense that even very small steps may increase the cost function. A more rigorous explanation of this, added by Raghav Somani is given next.

Assuming local convexity, another way of looking at ill-conditioned Hessian is by considering its eigenvalues. Condition number of the Hessian is high if the largest positive eigenvalue of the local Hessian is too large, or when the lowest eigenvalue of the Hessian is very close to zero. It can be shown that the optimal step size is less than the inverse of the local smoothness constant of the function (smoothness, is informally proportional to the largest eigenvalue), so for the former case, if the largest eigenvalue of the Hessian is very large, the step-size required needs to be extremely small (leading to slow convergence). On the other hand, if the least eigenvalue is extremely small, then we again encounter slow convergence due to lack of enough convexity.

If local convexity is not assumed, it gets difficult to say just via first order optimization methods if we are converging to a saddle point or to a local minima.

In Chapter 4, it was shown that moving by a factor of ϵg would add the term given below to the cost function:

Derived using the Taylor Series Expansion

Ill-conditioning is said to happen when the first term exceeds the second term as then the cost would be increasing. In many cases, the gradient might be large leading to a large gradient norm (i.e. g’g). However, g’Hg might be even larger than the gradient norm. This would lead to slower learning as we would need to reduce the learning rate to make the first term lower than the second. To clarify more on this, since the first term contains the 2nd power of ϵ, and ϵ being less than 1, ϵ² < ϵ. So, to prevent ill-conditioning, the first term should be lower than the second, but given that g’Hg > g’g, this can be achieved only by reducing the learning rate, leading to slower learning. Thus, although ideally the gradient norm should decrease during training (since our primary aim is to reach a global minima where the gradient is 0), we can still get successful training even with the gradient norm increasing as shown below:

  • Local minima: Nearly any Deep Learning (DL ) model is guaranteed to have an extremely large number of local minima (LM) arising due to the model identifiability problem.
There might be many local minima where the model may get stuck due to zero gradient. Source: http://www.yaldex.com/game-development/1592730043_ch18lev1sec4.html

A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model parameters. In case of neural networks, we can obtain equivalent models by swapping the position of the neurons. Thus, they are not identifiable.

Swapping the two hidden nodes leads to equivalent models. Thus, even after having a sufficiently large training set, there is not a unique setting of parameters. This is the model identifiability problem that neural networks suffer from.

However, all the local minima caused due to this have the same value of the cost function, thus not being a problem. However, if local minima with high cost are common, it becomes a serious problem as shown above. Many points other than local minima can lead to low gradients. Nowadays, it’s common to aim for a low but not minimal cost value.

  • Plateaus, Saddle Points and Other Flat Regions: Saddle point (SP) is another type of point with zero gradient where some points around it have higher value and the others have lower. Intuitively, this means that a saddle point acts as both a local minima for some neighbors and a local maxima for the others. Thus, Hessian at SP has both positive and negative eigenvalues (a very good explanation for this can be found here. TL;DR — for a function to curve upwards or downwards around a point as in the case of local minima and local maxima, the eigenvalues should have the same sign, positive for local minima and negative for local maxima).
It’s called a saddle point as it looks like the saddle of a horse. Source: deeplearning.ai. I know Andrew Ng is awesome ;)

For many classes of random functions, saddle points become more common at high dimensions with the ratio of number of SPs to LMs growing exponentially with n for an n-dimensional space. Many random functions have an amazing property that near points with low cost, the Hessian tends to take up mostly positive eigenvalues. SGD empirically tends to rapidly avoid encountering a high-cost saddle point. For SGD to rapidly avoid saddle points, it needs to have sufficiently high stochastic variance along all directions. Langevin dynamics stochastic gradient, is a set of SGD based algorithms that play around with SGD with added white noise to get off local minimas and saddles with high probability under mind conditions. There also might be wide, flat regions of constant value, thereby having a zero gradient. These can be problematic if the cost is high in these regions.

It is problematic to get stuck in a plateau where the value of the cost function is high. Source: https://en.wikipedia.org/wiki/Local_search_(constraint_satisfaction)
  • Cliffs and Exploding Gradients: Neural Networks (NNs) might sometimes have extremely steep regions resembling cliffs due to the repeated multiplication of weights. Suppose we use a 3-layer (input-hidden-output) neural network with all the activation functions as linear. We choose the same number of input, hidden and output neurons, thus, using the same weight W for each layer. The output layer y = W*h where h = W*x represents the hidden layer, finally giving y = W*W x. So, deep neural networks involve multiplication of a large number of parameters leading to sharp non-linearities in the parameter space. These non-linearities give rise to high gradients in some places. At the edge of such a cliff, an update step might throw the parameters extremely far.
Image depicting the problem of exploding gradients when approaching a cliff. 1) Usual training going on with the parameters moving towards the lower cost region. 2) The gradient at the bottom left-most point pointed downwards (correct direction) but the step-size was too large, which caused the parameters to land at a point having large cost value. 3) The gradient at this new point moved the parameters in a completely different position undoing most of the training done until that point.

It can be taken care of by using Gradient Clipping (GC). The gradient indicates only the direction in which to make the update. If the GD update proposes making a very large step, GC intervenes to reduce the step size.

  • Long-Term Dependencies: This problem is encountered when the NN becomes sufficiently deep. For example, if the same weight matrix W is used in each layer, after t steps, we’d get W *W * W … (t times). Using the eigendecomposition of W:
Here, V is an orthonormal matrix, i.e. V V’ = I

Thus, any eigenvalues not near an absolute value of one would either explode or vanish leading to the Vanishing and Exploding Gradient problem. The use of the same weight matrix is especially the case in Recurrent NNs (RNNs), where this is a serious problem.

Values near 1 either explode or vanish upon being compounded. You might have seen this poster in a separate context.
  • Inexact Gradients: Most optimization algorithms use a noisy/biased estimate of the gradient in cases where the estimate is based on sampling, or in cases where the true gradient is intractable for e.g. in the case of training a Restricted Boltzmann Machine (RBM), an approximation of the gradient is used. For RBM, the contrastive divergence algorithm gives a technique for approximating the gradient of its intractable log-likelihood.
  • Neural Networks might not end up at any critical point at all and such critical points might not even necessarily exist. A lot of the problems might be avoided if there exists a space connected reasonably directly to the solution by a path that local descent can follow and if we are able to initialize learning within that well-behaved space. Thus, choosing good initial points should be studied.

3. Basic Algorithms

  • Stochastic Gradient Descent: This has already been described before but there are certain things that should be kept in mind regarding SGD. The learning rate ϵ is a very important parameter for SGD. ϵ should be reduced after each epoch in general. This is due to the fact that the random sampling of batches acts as a source of noise which might make SGD keep oscillating around the minima without actually reaching it. This is shown below:
Source: https://goo.gl/tq6Xof

The true gradient of the total cost function (involving the entire dataset) actually becomes 0 when we reach the minimum. Hence, BGD can use a fixed learning rate. The following conditions guarantee convergence under convexity assumptions in case of SGD:

Setting it too low makes the training proceed slowly which might lead to the algorithm being stuck at a high cost value. Setting it too high would lead to large oscillations which might even push the learning outside the optimal region. The best way is to monitor the first several iterations and set the learning rate to be higher than the best performing one, but not too high to cause instability.

A big advantage of SGD is that the time taken to compute a weight update doesn’t grow with the number of training examples as each update is computed after observing a batch of samples which is independent of the total number of training examples. Theoretically, for a convex problem, BGD makes the error rate O(1/k) after k iterations whereas SGD makes it O(1/√k). However, SGD compensates for this with its advantages after a few iterations along with the ability to make rapid updates in the case of a large training set.

  • Momentum: The momentum algorithm accumulates the exponentially decaying moving average of past gradients (called as velocity) and uses it as the direction in which to take the next step. Momentum is given by mass times velocity, which is equal to velocity if we’re using unit mass. The momentum update is given by:
Momentum weight update

The step size (earlier equal to learning rate * gradient) now depends on how large and aligned the sequence of gradients are. If the gradients at each iteration point in the same direction (say g), it will lead to a higher value of the step size as they just keep accumulating. Once it reaches a constant (terminal) velocity, the step size becomes ϵ || g|| / (1 — α). Thus, using α as 0.9 makes the speed 10 times. Common values of α are 0.5, 0.9 and 0.99.

Viewing it as the Newtonian dynamics of a particle sliding down a hill, the momentum algorithm consists of solving a set of differential equations via numerical simulation. There are two kinds of forces involved as shown below:

Momentum can be seen as two forces operating together. 1) Proportional to the negative of the gradient such that whenever it descends a steep part of the surface, it gathers speed and continues sliding in that direction until it goes uphill again. 2) A viscous drag force (friction) proportional to -v(t) without the presence of which the particle would keep oscillating back and forth as the negative of the gradient would keep forcing it to move downhill . Viscous force is suitable as it is weak enough to allow the gradient to cause motion and strong enough to resist any motion if the gradient doesn’t justify moving

Read more about momentum in this excellent blog post by distill.ai: Why Momentum Really Works.

  • Nesterov Momentum: This is a slight modification of the usual momentum equation. Here, the gradient is calculated after applying the current velocity to the parameters, which can be viewed as adding a correction factor:
Nesterov momentum weight update

The intuition behind Nesterov momentum is that upon being at a point θ in the parameter space, the momentum update is going to shift the point by αv. So, we are soon going to end up in the vicinity of (θ + αv). Thus, it might be better to compute the gradient from that point onward. The figure below describes this visually:

Source: http://cs231n.github.io/neural-networks-3/#sgd

This concludes the 1st part of our summary for Chapter 8. We hope that this gave you a good intuition and understanding on the fundamental concepts behind Optimization used in Deep Learning. Part II of the summary would contain more advanced concepts like algorithms with adaptive learning rates and batch-normalization which are also fundamental in making Deep Learning practically applicable to real-world problems. To stay updated with our posts, follow our publication, Inveterate Learner. If you’re looking for the summary of previous chapters or are more comfortable reading from a Jupyter Notebook, feel free to have a look over our repository here. Finally, I thank my co-author for the publication, Ameya Godbole, for reviewing the post and suggesting many important areas for further explanations along with several nip ticks. A special thanks to Raghav Somani for providing deeper mathematical insights and clarity in a lot of places that gave me (and I hope everyone else too) a better understanding of the concepts.

When something is important enough, you do it even if the odds are not in your favor.

- Elon Musk

If there’s anything that you might want to share with me or give any sort of feedback on my writing/thoughts, I would love to hear it from you. Feel free to connect with me on LinkedIn, Facebook or follow me on Github.

--

--

Aman Dalmia
Inveterate Learner

Curious about almost everything. Passionate about climate change and education. Trying to be helpful!