How to tame the valley — Hessian-free hacks for optimizing large #NeuralNetworks

Published in

Autonomous Agents

11 min readOct 4, 2016

Let’s say you have the gift of flight (or you are riding a chopper). You are also a Spy (like in James Bond movies). You are given the topography of a long narrow valley as shown in the image and you are given a rendezvous point to meet a potential aide who has intelligence that is helpful for your objective. The only information you have about the rendezvous point is as follows:

“Meet me at the lowest co-ordinate of ‘this long valley’ in 4 hours”

How do you go about finding the lowest co-ordinate point? More so, how do you intend to find it in a stipulated time period?

In this article I intend to demonstrate that by delving into the mathematical intricacies we can understand and answer this question in an accurate manner.

Well, for complex Neural Networks which has very large parameters, the error surface of the Neural Network is very similar to the long narrow valley of sorts. Finding a “minima” in the valley can be quite tricky when you have such pathological curvatures in your topography.

Note: there are many posts written on second-order optimization hacks for Neural Network. The reason I decided to write about it again is that most of it jumps straight into complex Math without much explanation.
Instead, I have tried to explain Math as briefly where possible and mostly point to detailed sources to learn if you are not trained in the particular field of Math.
This post shall be a bit longish due to that.

In the past posts, we used Gradient Descent algorithms while Back-propagating that helped us minimize the errors. You can find the techniques in the post titled “Backpropagation — How Neural Networks Learn Complex Behaviors”

Limitations of Gradient Descent

There is nothing fundamentally wrong with a Gradient Descent algorithm [or Stochastic Gradient Descent (SGD) to be precise]. In fact we have proved that it is quite efficient for some of the Feed Forward examples we have used in the past. The problem of SGD arises when we have “Deep” Neural Networks which has more than one hidden layer. Especially when the Network is fairly large.

Here are some illustrations of a non-monotonic error surface of a Deep Neural Network to get an idea.

Note that there are many minima and maxima in the illustration. Let us quickly look at the weight update process in SGD

The problem with using SGD for the illustrations is as follows:

Since SGD uses first order optimization method, it assumes that the error surface always looks like a plane (In the direction of descent that is) and does not account for curvature.
When there is a quadratic curvature, we apply some tricks to ensure that SGD does not just bounce off the surface as shown in the weight update equation.
We control the momentum-value using some pre-determined alpha and control the velocity by applying a learning rate epsilon.
The alpha and the epsilon buffers the speed and direction of SGD and slows down the optimization until we converge. We can only tune these hyper-parameters to get a good balance of speed versus effectiveness of SGD. But they still slow us down.
In large networks with pathological curvatures as shown in illustration, tuning these hyper-parameters is quite challenging.
The error in SGD can suddenly start rising when you move in the direction of the gradient when you are traversing a long narrow valley. In fact SGD can almost grind to a halt before it can make any progress at all.

We need a better method to work with large or Deep Neural Networks.

Second Order Optimization to the Rescue

SGD is a first order optimization problem. First order methods are methods that have linear local curves. In that we assume that we can apply linear approximations to solve equations. Some examples of first-order methods are as follows:

Gradient Descent
Sub-Gradient
Conjugate Gradient
Random co-ordinate descent

There are methods called the second-order methods which considers the convexity or curvature of the equation and does quadratic approximations. Quadratic approximations is an extension of linear approximations but provide an additional variable to deal with which helps create a quadratic surface to deal with a point on the error surface.

The key difference between the first-order and second-order approximations is that, while the linear approximation provides a “plane” that is tangential to a point on a error surface, the second-order approximation provides a quadratic surface that hugs the curvature of the error surface.

If you are new to quadratic approximations, I encourage you to check this Khan Academy lecture on Quadratic approximations.

The advantage of a second-order method is that, it shall not ignore the curvature of the error surface. Because of the fact that the curvature is being considered, second-order methods are considered to have better step-wise performance.

The full step jump of a second-order method points directly to the minima of a curvature (unlike first-order methods which requires multiple steps with multiple gradient calculation in each step).
Since a second-order method points to the minima of a quadratic curvature in one step, the only thing you have to worry about is how well the curve actually hugs the error surface. This is a good enough heuristic to deal with.
Working with the hyper-parameters given the heuristic becomes very efficient.

The following are some second-order methods

Newton’s method
Quasi-Newton, Gauss-Newton
BFGS, (L)BFGS

Let’s take a look at Newton’s method which is a base method and is bit more intuitive compared to others.

Yo! Newton, whats your Method?

Newton’s Method, also called Newton-Raphson Method is an iterative method approximation technique on the roots of a real valued function. This is one of the base method’s used in any second-order convex optimization problems to approximate functions.

Let’s first look at Newton’s method using first-derivate of a function.

Let’s say we have a function f(x) = 0, and we have some initial solution x_0 which we believe is sub-optimal. Then, Newton’s method suggest us to do the following

Find the equation for the tangent line at x_0
Find the point at which the tangent line cuts the x-axis and call this new point as x_1.
Find the projection of x_1 on the function f(x)=0 which is also at x_1.
Now, iterate again from step-1, by replacing x_0 with x_1.

Really that simple. The caveat is that the method does not tell you when to stop so we add a 5th step as follows:

5. If x_n (the current value of x) is equal to or lesser than a threshold then we stop.

Here is the image that depicts the above:

Finding optimal value of X using Newton’s Method.

Here is an animation that shows the same:

First-degree-polynomial, One-dimension:

Here is the math for a function which is a first degree polynomial with one-dimension.

Second-degree-polynomial, One-dimension

Now, we can work on Newton approximation for a second degree polynomial (second-order optimizations) function with one-dimension (before we get to multiple dimensions). A second degree polynomial is quadratic in nature and would need a second-order derivative to work with. To work on the second-derivative of a function, let’s use the Taylor approximation as follows:

Second-degree-polynomial, Multiple-dimension

Suppose that we are working on a second degree polynomial with multiple dimensions, then we work with the same Newton’s approach as we found above but replace the first-derivatives with a gradient and the second-derivatives with a Hessian as follows:

A Hessian Matrix is square matrix of second-order partial derivatives of a scalar, which describes the local curvature of a multi-variable function.

Specifically in case of a Neural Network, the Hessian is a square matrix with the number of rows and columns equal to the total number of parameters in the Neural Network.

The Hessian for Neural Network looks as follows:

Why is Hessian based approach theoretically better than SGD?

Now, the second-order optimization using the Newton’s method of iteratively finding the optimal ‘x’ is a clever hack for optimizing the error surface because, unlike SGD where you fit a plane at the point x_0 and then determine the step-wise jump, in second-order optimization, we find a tightly fitting quadratic curve at x_0 and directly find the minima of the curvature. This is supremely efficient and fast.

But !!! Empirically though, can you now imagine computing a Hessian for a network with millions of parameter? Of course it gets very in-efficient as the amount of storage and computation required to calculate the Hessian is of quadratic order as well. So, though in theory, this is awesome, in practice it sucks.

We need a hack for the hack ! And the answer seems to lie in Conjugate Gradients.

Conjugate Gradients, clever trick.

Actually, there are several quadratic approximation methods for a convex function. But Conjugate Gradient Method works quite well for a symmetric matrix, which are positive-definite. In fact, Conjugate Gradients are meant to work with very-large, sparse systems.

Note that a Hessian is symmetric around the diagonal, the parameters of a Neural Network are typically sparse, and the Hessian of a Neural Network is positive-definite (Meaning, it only has positive Eigen Values). Boy, are we in luck?

If you need a thorough introduction of Conjugate Gradient Methods, go through the paper titled “An Introduction to the Conjugate Gradient Method Without the Agonizing Pain” by Jonathan Richard Shewchuk. I find this quite thorough and useful. I would suggest that you study the paper in free-time to get a in-depth understanding of Conjugate Gradients.

The easiest way to explain the Conjugate Gradient (CG) is as follows:

The CG Descent is applicable on any quadratic form.
CG uses a step-size ‘alpha’ value similar to SGD but instead of a fixed alpha, we find the alpha through a line search algorithm.
CG also needs a ‘beta’ a scalar value that helps find the next direction which is “conjugate” to the first direction.

You can check most of the hairy-math around arriving at a CG equation by the paper cited above. I shall directly jump to the section of the algorithm of the conjugate gradient:

For solving a equation Ax=b, we can use the following algorithm:

Here r_k is the residual value,
p_k is the conjugate vector and,
x_k+1 is iteratively updated with previous value x_k and the dot product of the step-size alpha_k and conjugate vector p_k.

Given that we know how to compute the Conjugate Gradient, let’s look at the Hessian Free optimization technique.

Hessian-free Optimization Algorithm

Now that we have understood the CG algorithm, let’s look at the final clever hack that allows us to be free from the Hessian.

CITATION: Hessian-free optimization is a technique adopted to Neural Networks by James Marten at the University of Toronto in a paper titled “Deep-Learning Via Hessian Free Optimization”.

Let’s start with a second-order Taylor expansion of a function:

Here we need to find the best delta_x and then move to x+delta_x and keep iterating until converge. In other words, the steps involved in Hessian-free optimization is as follows:

Algorithm:

Start with i=0 and iterate
Let x_i be some initial sub-optimal x_0 choosen randomly.
At current x_n, Given the Taylor expansion shown as above, compute gradient of f(x_n) and hessian of f(x_n)
Given the Taylor expansion, Compute the next x_n+1 (which is nothing but delta_x) using the Conjugate Gradient algorithm.
Iterate steps 2–4 until the current x_n converges.

The crucial insight: Note that unlike in the Newton’s method where a Hessian is needed to compute x_n+1, in Hessian-free algorithm we do not need the Hessian to compute x_n+1. Instead we are using the Conjugate Gradient.

Clever Hack: Since the Hessian is used along with a vector x_n, we just need an approximation of the Hessian along with the vector and we do NOT need the exact Hessian. The approximation of Hessian with a Vector is far faster than computing the Hessian itself. Check the following reasoning.

Take a look at the Hessian again:

Here, the i’th row contains partial derivates of the form

Where ‘i’ is the row index and ‘j’ is the column index. Hence the dot product of a Hessian matrix and any vector:

The above gives the directional derivative of ‘e’ with respect to ‘w’ in the direction ‘v’.

Using finite differences, we can then optimize the above as following:

In fact a thorough explanation and technique for fast multiplication of a Hessian with a vector is available in the paper titled “Fast Exact Multiplication of the Hessian” by Barak A. Pearlmutter from Siemens Corporate Research.

With this insight, we can completely skip the computation of a Hessian and just focus on the approximation of the Hessian to a vector multiplication, which tremendously reduces the computation and storage capacity.

To understand the impact of the optimization technique, check the following illustration.

Note that with this approach, instead of bouncing off the side of the mountains like in SGD, you can actually move along the slope of the valley before you can find a minima in the curvature. This is quite effective for very large Neural Networks or Deep Neural Networks with million of parameters.

Apparently, It’s not easy to be a Spy…