Mathematical Underpinnings: SVMs

3 min readAug 14, 2018

In this, and subsequent, articles we are going to examine the mathematical underpinnings of support vector machines to understand some of the niceties that it exploits such as generalisability and convergence.

A number of solutions for w and b exist which satisfy the inequality above and can be found using the perceptron architecture (3 layer artificial neural network). An SVM tries to find one solution which gives the smallest generalisation error, this is found by minimising the distance between the decision boundary, y(x)=0, and the closest data point to the decision boundary.

This distance is

To better understand why let us take a look at the image below which illustrates two dimensional feature space (with no feature extraction/pre-processing).

Two dimensional feature space to demonstrate the distance of the margin.

First we have to understand that the weight vector, w, is perpendicular to the decision boundary.

Okay and now let us consider the point x lying above the decision boundary.

since d is always positive for either side of the decision boundary we take the distance to be

Distance from point x to the decision boundary.

So we want to maximise this distance, d.

Okay phew, we’ve gotten this far. So let’s just look at our optimisation equation above. The inverse weight vector norm has come outside since it is not dependent on the nth point. We have two things happening here, a point must be chosen which lies closest to the decision boundary and then parameters chosen to maximise this distance. Clearly, this problem has no closed solution, we need to add a constraint:

See the point closest to the decision boundary has y(x) = 1.

And in walks Lagrange Multipliers to take us to the next step which we’ll cover in another article. Let’s just recap what happened here.

We assumed that our data points are linearly separable by a decision boundary which we now want to learn. We weren’t happy with any old decision boundary, we want to find the one which maximises the margin, subject to the constraint that all our points are correctly classified. This would be ideal, but of course, data is never ideal and we’ll see how the SVM formulation deals with this in the next article.

[Bishop — Machine Learning and Pattern Recognition]

Mathematical Underpinnings: SVMs

Written by Emily Muller