Finding the Minimum: Escaping Saddle Points

Roan Gylberth
Konvergen.AI
Published in
4 min readNov 18, 2018

When I did my weekly Arxiv chores (via arxiv-sanity.com, which is a great site! Thanks Andrej Karpathy), I stumbled across this paper by Du et al. and I was really intrigued because I never asked the question “can gradient descent find global minimum in deep neural networks?” when training deep neural networks. This is partly because, in large networks, most local minima are good enough and struggling to find the global minimum may even lead to over-fitting (Chromanska et al.).

If we read the paper (or even just the title), we can see their answer to the question. This post will not discuss the claim, but rather discuss the studies done on deep neural networks optimization. I was fascinated by how much the recent studies have done to study this problem. So, these several posts will concentrate on the first two paragraphs in Section 1.2 of the paper and several previous research mentioned.

We have discussed in previous posts that the loss function of deep neural networks that we wanted to optimize is non-convex. Hence it is harder for the algorithm to reach a global minimum compared to a (strictly) convex function. This is because in convex function, finding a first-order stationary point is equivalent to finding the global minimum, and this is not the case in non-convex function. A first-order stationary point in non-convex function could be a local or global minimum, local or global maximum, or a saddle point.

This raises the interests in determining global convergence guarantee of an optimization algorithm when training deep neural networks. As discussed in the paper, one sensible approach is

first to develop a general theory for a class of non-convex problems which satisfy desired geometric properties and then identify that the neural network optimization belongs to this class.

Basically, this approach started with defining some set of geometric rules and then determine whether that neural network optimization satisfies that rules. One example of this function class is that

  1. All local minima of the function are the global minimum.
  2. There exists a negative curvature for every saddle point.

These rules ensure that there is always a way to the global minimum if you are not in there yet. One example of the problem that satisfies this function class is Orthogonal Tensor Decomposition (Ge et al., 2015). In this function class, the property “There exists a negative curvature for every saddle point” is one of the properties of strict saddle functions, introduced by Ge et al. They have also shown that noisy gradient descent can find local minimum on a strict saddle function. For more information and nice visualization about this, see Ge’s paper or blog.

Illustration of Strict and Non-Strict Saddle Point (Chi Jin & Michael Jordan) http://www.offconvex.org/2017/07/19/saddle-efficiency/

Now, did we really need perturbation to make sure that we reach a local minimum? Lee et al. (2016) show in their paper that with random initialization, gradient descent will always escape the saddle point, hence it will eventually converge to a local minimum. However, Du et al. (2017) show that gradient descent without perturbation can take exponential time to escape saddle points. This is undesirable in the landscape of deep neural network that has many saddle points because it can significantly slow down the training.

So, perturbation like noisy gradient descent is needed to be able to escape saddle points in polynomial time. However, this method requires at least Ω(d⁴) number of iteration, where d is the number of dimensions. This is relatively suboptimal compared to the rates of convergence to the first-order stationary points which is not depending on the number of dimensions. This can be problematic because even though we could efficiently find a first-order stationary point, if that point turns out to be a saddle point, it will hinder the performance greatly. So can we find a way to reach second-order stationary points as efficient as first-order stationary points?

To answer this, Jin et al. (2017) in their study modify the gradient descent with phasic perturbation, achieving almost the same convergence rate to the second-order stationary points as the convergence rate of gradient descent to the first-order stationary points. Do note that this perturbation did not remove the dimension dependence completely, but to a negligible log⁴(d) factor. You can look up more about this in their blog.

By these results, we know that within the mentioned function class, gradient descent can reach the global minimum. But we still need to explore whether neural network optimization problem belongs in this class. In the next post, we will explore various studies that try to understand the optimization landscape of different neural networks with various activation functions.

References:

  1. S. S. Du, J. D. Lee, H. Li, L. Wang, X. Zhai, “Gradient Descent Finds Global Minima of Deep Neural Networks” (2018)
  2. A. Chromanska, M. Henaff, M. Mathieu, G. B. Arous, Y. LeCun, “The Loss Surfaces of Multilayer Networks” (2015)
  3. R. Ge, F. Huang, C. Jin, Y. Yuan, “Escaping From Saddle Points — Online Stochastic Gradient for Tensor Decomposition” (2015)
  4. J. D. Lee, M. Simchowitz, M. I. Jordan, B. Recht, “Gradient Descent Converges to Minimizers” (2016)
  5. S. S. Du, C. Jin, J. D. Lee, M. I. Jordan, B. Póczos, A. Singh, “Gradient Descent Can Take Exponential Time toEscape Saddle Points” (2017)
  6. C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, M. I. Jordan “How to Escape Saddle Points Efficiently” (2017)

--

--