SVM Talking maths: Quadratic Programming and Cholesky factorisation

Maria
7 min readSep 28, 2021

--

Here we talk about maths behind SVM numerical algorithms.

Setting the scene. Have you ever wondered why one algorithm works and the other does not? This story opens up a new series of articles that is designed as complete guide with the objective not only to teach you how to write a few lines of function calls, but also to make mathematics more comprehensible for those who’d like to take informed decisions while choosing an algorithm. Furthermore, this also allows to brush up on your skills before taking any ML-related interviews.

Our very first series is designed to familiarise you with Support Vector Machines (SVM), one of the well-known supervised learning algorithms. The focus of this series is not on linear algebra that is behind separating hyperplanes. This has been discussed over and over and over again in numerous articles and ML books — good one can be found here. Thus, we decided to look at SVM as on optimisation problem and introduce two very different algorithms that help us to tackle it. We start off from the basics of constrained optimisation and then move to problem formulations under two effective optimisation techniques.

Here’s our plan — so you will not need to spend extra time reading something that you have already known before.

  1. Talking maths: Quadratic Programming and Cholesky Decomposition
  2. Talking maths: Support Vector Machine as constrained optimisation problem
  3. Talking algorithms: Using Interior Point Methods for SVM training
  4. Talking algorithms: Using Sequential Optimisation for SVM training

In this article we break down the general form of constrained optimisation problem, talk about Lagrangian, its dual form and finally discuss Cholesky factorisation.

What is convexity?

We begin with some mathematical ideas that will come in handy as we go forward. Let’s talk about convexity first. We can have convex set and convex function. Let’s imagine that we have a collection of elements S (set) . Essentially, it is when you can draw a line joining any two points that belong to our newly created set S and the line segment will lie entirely below. Concavity is just the same thing but flipped over — now you have a line segment that lies above. This is informal definition that helps to picture things in your head. Like this

Let’s now talk about the convex function. Things become slightly more complicated — we need to provide one property of convex functions. This is — the function is defined to be convex if its domain S is a convex set (as we have just seen) and if for those two points x and y that belong to our set S the following property holds.

This is also called Jensen’s inequality and it’s a common interview question. Intuitively, you can think of expected value — this will help not to forget!

Jensen’s inequality

What are affine functions?

We have now discussed a very first tool that is needed to fully grasp the idea behind constrained optimisation. The second building block is affine functions. The easiest way to think about them is to start with system of linear equations. Remember this concept from linear algebra?

A function is affine if it is a sum of a linear function and a constant, i.e. if it has a form

Definition of affine functions

There is an interesting feature related to numerical optimisation and convexity. If we suppose that we have f(S) where f is affine and S is a convex set , then the image under f is convex. Just to save your time recalling what the image is — it is a set of all outputs.

The natural question is now: why are we bothered?

Constrained optimisation: Lagrangian and duality come together

To find an optimal hyperplane for SVM problem, we need to formulate it as constrained optimisation problem. This is when everything we have discussed so far will come together into one picture. Let’s go 🚀

The convex optimisation problem is one of the form

Constrained optimisation problem with equality and inequality constraints

and it requires

  • The objective function to be convex
  • The inequality constraint functions to be convex
  • The equality constraint functions to be affine

In this form we have everything separated, but what should we do if we want to put everything together? We call it Lagrangian!

The general idea of the Lagrangian (L) duality is to take constraint into account by augmenting the objective function with a weighted sum of the constraint functions.

Lagrangian mapping

And according to the description above, we have the following

Lagrangian form

Lambda and nu are Lagrangian multiplier. The first on is associated with inequality constraint and the second is associated with equality constraints. This problem can be flipped into dual form in hope that it will be easier to solve it. Then we have Lagrangian dual form

Lagrangian dual form mapping

It is defined as the minimum value of the Lagrangian over all x

Lagrangian dual form

There is one property of this function that makes it extremely useful. The dual function a pointwise infimum (inf) of a family of affine functions and by construction (you can read more about it here and here) it is concave, even when the problem is not convex. Thus, the dual function gives lower bound on the optimal value p*.

Confused by the last few sentences? I know we have been there too. Before we show one example that will clarify that, we need to note that for certain constrained optimisation problems this we have zero duality gap. This essentially means that the optimal minima can be achieved by considering either primal or dual problem. This property is exhibited (1) if we have a convex quadratic objective function and affine constraint functions and (2) if the Hessian matrix with respect to (wrt) x is semidefinite or definite. Good news! SVM is exactly this kind of problem 😄

Just one more thing that is very important to discuss — at the optimal minima Karush-Kuhn-Tucker (KKT) conditions must be satisfied! Don’t be bothered by that too much right now, we will explain how to incorporate them into the problem in the future!

Showing zero duality gap

Zero duality gap showed

Now we go back to zero duality gap and show visually what it really means. For representational purposes, we pick a very known function and one linear constraint. We want to find the pointwise infimum if we keep lambdas fixed.

Objective function and one constraint

Along the x-axis if we fix lambda, we can see parabola-looking surface, the nice thing about this plot is that we can clearly see here that on the other hand if we fix x, Lagrangian is the linear function of lambda. If the constraint is violated so we have x-axis which are infeasible gives us the positive slope, whereas for feasible x’s the slope would be negative. We have constructed a Lagrangian and the way it looks is going to be generic. To look at the dual function, because this Lagrangian is the quadratic function, we know how to take the inf of it, in this case the minimum (the vertex) and then we find the maximum. How does it corresponds to the Lagrangian? We sliced parallel to the x and taken inf every time wrt x and assigned it to the dual function. We plot on our surface the points that corresponds to the inf function for each individual lambda. That’s where the dual function lives on the Lagrangian.

Cholesky Factorisation

Now we are all set to formulate SVM as quadratic programming problem. We will do this in two upcoming articles! Now we need to discuss the very last mathematical tool that is required to fully grasp the idea behind Interior Point Methods — we now talk about Cholesky Decomposition!

Every positive definite matrix can be factored as

R is upper triangular with positive diagonal elements. We also recall that positive definite matrix is symmetric matrix when

These are mathematical formulations of definite matrix and Cholesky factorisation.

Now, the natural questions are: why do we need these all maths for SVM and how should we tackle this numerically? We will answer all these questions in the upcoming articles 😜 🚀

Test yourself

  1. What is Jensen’s inequality?
  2. How affine functions are defined?
  3. Define the constrained optimisation problem, Lagrangian and its dual form
  4. What kind of matrix can be decomposed as Cholesky Factorisation?

--

--