Convex optimization, unconstrained

7 min readFeb 1, 2020

This post is the first in a series of 3 articles dedicated to convex optimization, organized as follows:

Convex optimization, unconstrained
Convex optimization with equality constraints
Convex optimization with inequality constraints and the Interior Point Method

Introducing Convexity

A real-valued function f is convex if the line segment (or chord) between any two points f(x) and f(y) lies above the function graph:

Mathematically, this is equivalent to the following equation, sometimes called Jensen’s inequality:

Where the domain of f is a convex set. Alternatively, a function f is said to be concave if its opposite, -f(x), is convex:

An interesting condition for convexity exists for twice-differentiable functions, and is formulated as follow:

These functions have positive curvature everywhere on their domain of definition, so you can picture them as having a bowl-like shape. For example in 2 dimensions a convex function could look like this:

Convex analysis can easily be the topic of an entire semesters in university, so we won’t go into much details (see references for more infos). Instead, let’s focus on interesting properties of convex functions that can be exploited when searching for their minimum.

A fundamental property of convex functions is that they exhibit a local to global phenomenon: if x* is a local minimum of a convex function f then x* is a global minimum of f.

In practice this is very handy: for example, we can look for a point x* for which the gradient of f is 0:

By solving this equation, we would find a local minimum of f , say x* (as the gradient is only local). However f has a positive curvature f’’(x); so f(x) will increase if we move x slightly around x*; repeating this process gradually, f(x) will keep increasing (and more and more quickly) as we move x further away from x*… So the local minimum we found by solving (1) was actually the global minimum!

For convex functions, we have the following equivalence:

In practice, this means we can implement fast algorithms to find x*, and will be guaranteed to have found the global minimum of f. Isn’t it beautiful? 😀

Now let’s introduce convex optimization problems and their formulation, before diving into popular techniques to solve them.

Convex optimization

In an optimization problem, we are looking for a vector x that minimizes a function f, x being eventually subject to constraints.

An optimization problem is usually written in its standard form:

For this problem to be convex, it has to meet 3 requirements:

the objective function f has to be convex
the inequality constraints have to be convex
the equality constraints have to be affine, i.e. of the form a*x+b*y+c=0.

Recognizing such convex problems when they arise takes some practice. It requires knowledge of families of convex functions and the composition operations that preserve convexity.

There are also several tricks to reformulate a seemingly non-convex problem into a convex one. For more information, you can have a look at the excellent book by S. Boyd and L. Vandenberghe, Convex Optimization [A]. Below are common convex functions and rules listed in this book:

Popular operations and their effect on convexity

Now let’s suppose that we are facing such a convex problem and see how to solve it.

In this post we analyze unconstrained problems; we will introduce equality and inequality constraints in the next articles of this series. Note that techniques such as Lagrange multipliers and Karush-Kuhn-Tucker (KKT) approach allow to convert a constrained problem into an unconstrained one by incorporating the constraints in the objective function. So the techniques detailed below will still be useful with constraints.

Gradient methods

Gradient methods are the popular iterative methods for solving convex minimization problems. They consist of 2 steps:

These algorithms will produce a minimizing sequence of points {x0, x1 … xn}, converging towards the global minimum. For the method to actually descend we need to have:

As we saw in (1.2), convexity implies that

So for our sequence to decrease, we need to have

In other words, the projection of the search direction x[k+1]-x[k] on the gradient of f must be negative.

Gradient Descent

A natural choice for the search direction s[k] is simply to align it with the opposite gradient, in which case we get the gradient descent method:

The stopping criterion can be based on the function value:

Or in terms of the norm of the gradient of f, usually preferred as less sensitive to the scale of f:

The step size, a[k], can be a small fixed value (ex: 0.001); or can decay as a function of the number of steps, for example a[k]=0.001*decay_rate^(k/decay_steps). There are even techniques such as backtracking line search to search for a better step size to use at each iteration.

Here is a toy implementation to illustrate the method for a function of 2 variables:

Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms. Optimization algorithms that use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms.

Newton Method

Photo by Nikolai Chernichenko on Unsplash

Newton’s method is also an iterative method producing a descending sequence of steps converging towards the global minimum. This method differs from Gradient Descent in that it uses second order derivatives to determine the search direction.

Let’s illustrate it with an example. Suppose we are at point x(k); let’s approximate our function f with a quadratic polynomial around x(k). The second order Taylor’s expansion of f in x(k) gives:

The Newton step is the value of t* which minimizes the above quadratic approximation. It is easier to visualize its effect on a graph:

Now is it possible to find a closed formula for t*? Luckily, we can 🙌 . Our quadratic approximation is convex, so we can find t* by solving the following equation:

The Newton step then becomes:

Intuitively we can see that if our original function f is quadratic, then Newton’s method will find it’s global optimum in just one iteration. Also, if f is nearly quadratic then the approximation should be very close.

We can now formulate Newton method’s algorithm:

Which one to choose?

Should we use Gradient Descent or Newton’s Method? A short answer could be “it depends on the problem”.

If the function f has a positive definite Hessian, Newton’s method is likely to converge faster than Gradient Descent. If not, for example around saddle points, Newton’s method can produce update steps in the wrong direction.

When training neural networks for example, Gradient Descent is usually preferred. In the book Deep Learning [C] the authors mentioned: “Beyond the challenges created by certain features of the objective function, such as saddle points, the application of Newton’s method for training large neural networks is limited by the significant computational burden it imposes. The number of elements in the Hessian is squared in the number of parameters, so with k parameters (and for even very small neural networks, k can be in the millions), Newton’s method would require the inversion of a (k*k) matrix — with computational complexity of O(k³). Also, since the parameters will change with every update, the inverse Hessian has to be computed at every iteration. As a consequence, only networks with very small number of parameters can be practically trained with Newton’s method.”

Conclusion

This article introduced convex optimization and some popular methods to find the global minimum of unconstrained problems. I hope you found it useful!

In the next 2 posts, we will see how to introduce equality and inequality constraints in the model and explain what are Lagrange multipliers, logarithmic barriers, and the Interior Point Methods.

Thank you for reading ! 🙂

References

[A] Convex Optimization — book by Stephen Boyd and Lieven Vandenberghe. Chapters 3 and 9.

[B] Convex Optimization — article on Wikipedia: https://en.m.wikipedia.org/wiki/Convex_optimization

[C] Deep Learning — book by Ian Goodfellow, Yoshua Bengio and Aaron Courtville, Chapters 4 and 8.