source: quanta magazine

Implementing L-BFGS optimization from scratch

Abhijit Mondal
Gaussian Machine
Published in
14 min readJul 4, 2024

--

In this series of posts we are going to look at how we can use L-BFGS-B optimization in dynamic pricing algorithms. The posts are inspired by our team’s work at Microsoft in building a distributed and scalable spot virtual machine dynamic pricing engine.

The posts are broken down as follows:

  1. Understanding and implementing L-BFGS optimization from scratch.
  2. Understanding and implementing automatic differentiation from scratch.
  3. Understanding and implementing L-BFGS-B for bounded constraint problems.
  4. Improving the run-time performance and memory utilization in the L-BFGS-B optimization algorithm.
  5. Using L-BFGS-B for dynamic pricing problems with different types of constraints.
  6. Optimizing our L-BFGS-B implementation to run on GPU.
  7. Making the L-BFGS-B optimization run in a distributed cluster for large number of parameters.

In this post, we are going to understand the basics of L-BFGS optimization algorithm and how it compares with some other 1st order and 2nd order optimization algorithms.

Assuming we are solving an unconstrained optimization problem of a single variable x : minimize f(x).

Then, we can write f(x) about a point x0 using the Taylor Series as follows:

f’(x0) is the 1st order derivative of f w.r.t. x evaluated at x=x0, and similarly f’’(x0) is the 2nd order derivative of f w.r.t. x evaluated at x=x0 and so on.

Note that the above formula holds true for any arbitrary complex function f(x).

1st order approximation of f(x) implies that we truncate the above formula for Taylor Series at the 2nd term i.e.

Similarly, 2nd order approximation of f(x) implies that we truncate the formula for Taylor Series at the 3rd term i.e.

For any x, which is very close to x0 i.e. x-x0 -> 0, then the higher order terms such as (x-x0)³, (x-x0)⁴ and so on becomes very small and thus the 2nd order approximation is very close to the actual value of f(x).

Remember that to find the minimum value of a function f(x) (assuming f is continuous and both f’(x) and f’’(x) exists), we take the 1st derivative of f(x) w.r.t. x and set them to 0, i.e.

Once we solve the the above equation and find the corresponding x_opt at which f’(x_opt) = 0, we can then evaulate the 2nd order derivative at x_opt:

If the 2nd order derivative is greater than 0, then x_opt is a minima. If the 2nd order derivative is less than 0, then x_opt is a maxima. Note that if f(x) is a convex function, then there is only one global minima, but if f(x) is non-convex, then f(x) can have multiple local minima apart from a global minima.

The condition f’(x_opt) = 0 and f’’(x_opt) > 0 holds true for local minima. Thus, with non-convex functions, x_opt can be a local minima.

Solving the equation f’(x) = 0, when f(x) is some complex function may not be always feasible or can be too complicated. Thus, we cannot always use f(x) in its original form to find x_opt.

In such scenarios, the Taylor Series approximations comes handy.

For the 1st order approximation scenario:

Denoting h = x-x0, we have:

We want to find an optimum h, s.t. f(x0+h) is minimum among all possible values of h. Note that we cannot use the method of solving df/dh = 0 and then checking d²f/dh² > 0 because in the above formula, the 2nd derivative d²f/dh² do not exist.

Alternatively, we can find h s.t. f(x0+h) < f(x0), i.e.

Thus, f’(x0)*h must be less than 0.

If we choose h = -a*f’(x0) where a > 0, then we have f’(x0)*h = -a*(f’(x0)² which is < 0 because both a and (f’(x0)² > 0.

Thus, if we choose h = -a*f’(x0), or the next point from x0 as x1 = x0-a*f’(x0), then we will find f(x1) which is smaller than f(x0). This gives us an idea that if we move to x1 then we can lower the value from f(x0) and can bring us closer to the minima.

If we use the following update strategy for x_n in the n-th iteration, then:

This is what usually happens in the gradient descent algorithm, where ‘a’ is the learning rate.

Note that we have to carefully choose the constant ‘a’.

If a is too high, then the next parameter x_n+1 might overshoot the minima and thus it might happen that either the iterations oscillates and do not converge at all or it converges but very slowly.

If ‘a’ is too low, then although the iterations will converge to a local minima, but it might be very slow.

Another approach one can take is instead of using a constant value for ‘a’, one can update ‘a’ in each iteration. One can use heuristics to come up with satisfactory values for ‘a’ such as the sufficient decrease condition:

It says that after an update to x0, f(x1) should be “sufficiently” reduced w.r.t. f(x0) so that we do not converge towards the minima too slow. Another way to look at this is that if we plot the curve of f w.r.t. a, then the curve lies below the line f(x0) — c1*a*(f’(x0))², where the slope of the line is -c1*(f’(x0))².

The other is the curvature condition:

It says that, if we use the same descent direction -f’(x0) at both x0 and x1, then at x1, the angle between the descent direction and the slope is larger than the angle between the decent direction and the slope at x0. It implies that at x1, we are more closer to the minima than at x0.

In general, these conditions are known as the Wolfe Conditions and are used in the Line Search algorithm we will see later in this post.

If f(x) is a convex function, then it is guarenteed to converge to a global minima when the values of ‘a’ are small enough. For non-convex functions, most often we will land up in a local minima with some initial starting point x0. One needs to run the iterations with different initial points to be able to reach a “better” minima or a global minima.

Often the 1st order approach we took above can be slow to reach a local minima. To overcome this, one usually uses a 2nd order approximation to the Taylor Series to compute the descent directions.

In this case, we can use the approach where we set df/dh = 0 and check if d²f/dh² > 0 because d²f/dh² exists.

df/dh = 0 implies that :

If we take the 2nd derivative of f w.r.t. h, we get: d²f/dh² = f’’(x0). Thus in order for f(x0 + h) to be a minima, we also need to satisfy f’’(x0) > 0.

Using the 2nd order approximation, the update to x_n are as follows:

To find the optimum value of ‘a’ at each iteration we can perform the above mentioned line search.

Instead of a function of a single parameter x, we can similarly solve the above optimization for functions taking a vector of parameters X. The 1st derivative is replaced by the gradient vector and the 2nd derivative by the Hessian matrix.

For the 2nd order approximation, we can rewrite the equations in vector and matrix terms as follows:

X0 is a vector of size m, h is another vector of size m, f is the optimization function that takes in a vector of size m and returns a scalar.

g is the gradient vector and g^T is the transpose of the vector.

H is the hessian matrix and is of size m x m:

The update equation looks like:

The corresponding condition f’’(x0) > 0 for a vector of parameters X0, is that the hessian matrix H is positive definite matrix i.e. for any non-zero vector x, we have the following:

If the above condition is satisfied for the hessian matrix H, then we can move in the direction -B_n.g_n to reduce the value of the optimization function.

When the number of parameters i.e. the size of the vector X, m is very large ~ 1⁰⁶ or so, storing and computing the hessian matrix or its inverse is not scalable. To store a matrix of 1⁰¹² floating point values each of 64 bits, would require approximately 8TB of RAM.

Also, computing the inverse of a m x m matrix has a time complexity of O(m³).

In the BFGS approach, instead of computing the inverse of the hessian, its values are approximated using the below formula.

where we have the following:

Thus, we see that the inverse hessian matrix at iteration n+1 i.e. B_n+1 is only dependent on the hessian at iteration n, the difference between the gradients at iterations n and n-1 and the difference between the parameters X at iterations n and n-1.

Now, we do not need to compute the inverse hessian matrix. The initial inverse hessian i.e. B0 can be taken as the identity matrix. Another nice property of the above updates are that if B0 is positive definite, then B_n is positive definite for all n > 0.

But even with this modifications, we still have to keep the matrix B_n of size m x m in memory.

In the Limited Memory BFGS approach, we do not need to store the B_n matrices except for B0 which is an identity matrix of size m. Instead of B_n+1 using the recursion from the n-th iteration, if we recompute all of the previous values at each iteration n i.e.

Let:

Then we can write by expanding B_n in the update equation as follows:

Similarly, expanding the B_n-1 term completely, we get the following:

where

Note that for each iteration n, we have to compute the entire product from iteration 0 to n-1. Moreover, we need to keep all the values y_n and s_n in memory to compute the above terms.

If the number of iterations required to converge are very high, then the amount of memory required as well as the time complexity of this approach will very well exceed the vanilla BFGS strategy where we just needed to keep the inverse hessian matrix from last iteration in memory.

In L-BFGS, instead of using all the previous iterations, only the latest k iterations’ values are considered and the rest are discarded i.e.

Similarly, we use only the latest k values for Wn.

Note that this is an approximation and usually for k between 5 and 10, the inverse hessian computed with latest k values are very much close to the true inverse hessian.

Let’s estimate the time and space complexity with this approach.

Instead of computing B_n and then multiplying with the gradient vector g_n, we can calculate B_n.g_n together. This is because when we are computing B_n beforehand, we need to store a matrix of size m x m in memory which we saw earlier would take up around 8 TB of RAM.

Instead if we compute the following:

Computing the dot product yT_n.g_n, where yT_n is of dimensions 1 x m and g_n is of dimensions m x 1, would result in a scalar quantity. Thus uT_n.g_n is a vector of dimension m x 1 and only requires keeping O(m) values in memory.

Next we take the dot product of the above quantity with uT_n-1.

Thus, again we are only storing O(m) data in memory. Continue the above multiplications till we multiply with the matrix B0. Since B0 is an identity matrix, the multiplication of B0 with the above chain of multiplications results in a vector of size m x 1.

The analysis is same for W_n.g_n. Thus, we see that the space complexity of doing the dot product of B_n.g_n is O(m). To compute the terms u_n and w_n, we need to keep the last k pairs of y_n and s_n in memory i.e. a space complexity of O(km).

Thus the overall space complexity of the above approach is O(k*m), where k is the maximum history we are using.

Time complexity is O(k²m) because when we are taking the dot product of W_n with g_n. There are k terms in W_n and for each term, taking the dot product with g_n has time complexity of O(km), infact it is O(jm) for the j-th term in W_n.

But since k = O(10), the overall time complexity per iteration can still be considered to be O(m).

Once we have obtained the dot product B_n.g_n, then we can go ahead and update the parameter vector X_n+1 as follows:

Only thing remaining is finding the optimum value for the constant ‘a’. There are several ways to calculate the value of ‘a’.

One such strategy is as follows:

Let us define: F(a)=f(X_n)-c1*a*B_n.g_n

For any 2 step sizes, a_low and a_high, if we take some step in between them, let’s say a_mid = (a_low + a_high)/2.

If F(a_mid) > f(X_n+1), then a_mid is a ‘good’ step and we can search for larger step sizes in the interval (a_mid, a_high]. If we get any ‘good’ step in the updated interval then we take that else we use the current ‘good’ step.

Similarly, if F(a_mid) ≤ f(X_n+1), then a_mid is not a ‘good’ step. In this case, we can search in the low interval [a_low, a_mid). But it may happen that there are ‘good’ steps in (a_mid, a_high] but not in [a_low, a_mid).

If we do not find any ‘good’ step in [a_low, a_mid), then we return some default small value for ‘a’ as the step size.

A ‘good’ step is defined based on the Wolfe conditions shown below:

Let the maximum value of a = 1.0 and a_default = 1e-7, then a_low = 0.0 and a_high = 1.0

  1. Find a_mid = (a_low + a_high)/2
  2. Calculate X_n+1 and g_n+1 using a_mid in the above formula.
  3. If the sufficient decrease condition is satisfied then a_mid is a “good” candidate for ‘a’ and we try to search for larger values of ‘a’ s.t. the conditions still holds good by setting a_low = a_mid + 1e-7 and repeat from step 1.
  4. In step 3 if for a_mid, the curvature condition is also satisfied, then we return a_mid as the optimum value for a.
  5. If the sufficient decrease condition do not satisfy, then reduce the value of ‘a’ by setting a_high = a_mid — 1e-7 and repeat from step 1.

Once we have found the optimum value of X_n, we can stop iterating when:

Let’s get down to some coding.

Python implementation of the above L-BFGS algorithm applied on the 2D Rosenbrock function:

import numpy as np
from collections import deque

def fn(x):
# Compute the optimization function
return (1.0-x[0,0])**2 + 100*(x[1,0]-x[0,0]**2)**2

def gfn(x):
# Gradient of fn(x)
return np.array([[-2*(1.0-x[0,0]) - 400*x[0,0]*(x[1,0]-x[0,0]**2)],
[200*(x[1,0]-x[0,0]**2)]])

def line_search(x, g, d, a0=1e-5, e=1e-6):
# Perform line (binary) search to find optimal value of a
left, right = 0.0, 1.0
a, c1, c2 = a0, 1e-4, 0.9

while left < right:
m = (left + right)/2

x_new = x - m*d
g_new = gfn(x_new)

# sufficient decrease condition
if fn(x_new)-fn(x) < -c1*m*np.dot(g.T, d):
# curvature condition
if abs(np.dot(d.T, g_new)) < c2*abs(np.dot(d.T, g)):
return m

a = m
left = m + e
else:
right = m - e

return a

def optimize(x0, k=10, epsilon=1e-7, max_iter=1e5):
# l-bfgs optimization algorithm
m = len(x0)

x = x0
g = gfn(x)

# Store the last k values of y and s vectors
prev_values = deque([(x, g)])

while max_iter > 0:
print(f"Current objective value = {fn(x)}")
max_iter -= 1

prev_x = x
prev_g = g

q = g
for i in range(len(prev_values)-1, -1, -1):
y, s = prev_values[i]
r = 1.0/np.dot(y.T, s)
q = q - r*np.dot(y.T, q)*s

for i in range(len(prev_values)):
y, s = prev_values[i]
r = 1.0/np.dot(y.T, s)
q = q - r*np.dot(s.T, q)*y

p = 0
for i in range(len(prev_values)-1, -1, -1):
w = g

for j in range(len(prev_values)-1, -1, -1):
if j > i:
y, s = prev_values[j]
r = 1.0/np.dot(y.T, s)
w = w - r*np.dot(y.T, w)*s
else:
break

y, s = prev_values[i]
r = 1.0/np.dot(y.T, s)
w = r*np.dot(s.T, w)*s

for j in range(len(prev_values)):
if j < len(prev_values)-1-i:
y, s = prev_values[j]
r = 1.0/np.dot(y.T, s)
w = w - r*np.dot(s.T, w)*y
else:
break
p += w

# Descent direction
d = p+q

# Perform line search
a = line_search(x, g, d)

x = x - a*d
g = gfn(x)

y = x-prev_x
s = g-prev_g

prev_values.append((y, s))

# Maintain only latest k values
if len(prev_values) > k:
prev_values.popleft()

# Exit condition
if np.abs(y).all() < epsilon:
break

return x

if __name__ == '__main__':
x0 = np.array([[0.003], [0.006]])
x = optimize(x0, k=10, epsilon=1e-7)
print(f"Final solution = {x}")

In the above code, we had to explicitly specify the gradient function because the gradient is specific to the objective function and for each new function we have to update the gfn(x) gradient function. Also if the objective function is complicated, then finding the derivative calculations can be erroneous and cumbersome.

In the next post we will see how we can use the automatic differentiation technique to automatically find the gradient without having to update it for every new function.

References:

  1. Numerical Optimization: Understanding L-BFGS — aria42
  2. paper.dvi (northwestern.edu)
  3. paper.dvi (northwestern.edu)
  4. S0025–5718–1980–0572855–7.pdf (ams.org)
  5. Backtracking line search — Wikipedia
  6. Wolfe conditions — Wikipedia
  7. Line search — Wikipedia
  8. Lecture 20: Line Search Procedures; Newton’s Method with Hessian Modification (wisc.edu)

--

--