Support Vector Machines

Published in

Vision and Language Group

7 min readMay 14, 2020

You must be knowing a bit about Machine Learning Classifiers, right? The basic aim is to separate training examples belonging to different classes. For example, if we have the height and weight of people in general and we want our algorithm to distinguish children from adults, we would use the right combination of height and weight to map the age of a person, and predict according to that, right? Well, there are different ways to do this as well like, Logistic Regression, Naive-Bayes Classification, Decision Trees, K nearest neighbours, to name a few. What we are going to see in detail in this blog, is another one of these techniques, called a Support Vector Classifier. Fancy the name? Not too much, I guess. But, I promise you, you will, by the end of this blog.

Suppose a person puts balls of two different colours, pink and green on a table in front of you. He then gives you a stick to separate the two colours. You comply and put the stick in such a fashion:

These balls represent two different classes and the stick represents the decision boundary

Then why don’t you? Because we want to separate them the best way possible. That is the difference between a binary logistic regressor and a support vector classifier. A logistic regressor just gives you a valid decision boundary, but a support vector classifier gives you the best decision boundary there is..or if you want to get a bit more technical, the optimal hyperplane. An optimal hyperplane is generally preferred because it gives low training error as well as low generalisation error.

However, what we really want to find out is, how can a support vector classifier know what is the optimal hyperplane? The answer is in simple math:

Let us consider the linearly separable examples mentioned above. What we want is to create the widest possible street between the two types of examples that does not touch or contain any point. So, when we choose our optimal hyperplane we choose the one among the set of hyperplanes which is the farthest from the closest data points.

Now, let our optimal hyperplane be at a distance of b from our origin, and let w be its normal vector, then for any point x on the plane, its distance from the origin along the normal w, must be equal to b, so the equation of the optimal hyperplane is:

vector equation of the optimal hyperplane

Now, taking some margin ‘m’ and scaling it to 1 for simplicity, we get the equation of the support hyperplanes:

vector equation of the support hyperplane towards y = -1

vector equation of the support hyperplane towards y = +1

Now, let us have n training examples each having D dimensions, having labels of either y = 1 or y = -1 class, and our examples are linearly separable. Also, to calculate the margin, we calculate the distance between the two support hyperplanes.

A bit of vector algebra leads to this result

Now, to merge both our support hyperplane classification conditions, we multiply it by y of respective class. Thus, we want all our training examples to satisfy the unified condition:

The points obeying this constraint will lie in the exterior of the margin, that is outside the street

Now, we want to maximise the margin under the above constraint because maximum margin classification is characteristic of a Support Vector Classifier. Thus, we have reached an optimisation problem. But, we can’t use gradient descent to solve this, because this is a constrained optimisation problem and it is better to use Lagrange’s multipliers to find the optimal solution. We should transform this maximisation problem to a minimisation problem as maximising the margin is the same as minimising the squared L2 norm of w:

Squared L2 Norm was preferred over L1 because it is more stable.

So, our Lagrangian turns out to be :

As you may have guessed, this was a primal optimisation problem, and to transform this into a dual optimisation problem, we substitute the value of w back into the Lagrangian and then maximise it now, in order to obtain the value of λ. After doing some algebra, we obtain:

Hence, we calculated our Optimal Hyperplane for linearly separable classes. But, what if the person put a pink ball as such:

The yellow examples were misclassified due to the presence of an outlier.

As shown in the above animation, the presence of an outlier can change your decision boundary and result in misclassification of many examples as y = -1 despite them being closer to y = 1 or vice-versa. This is a drawback of Hard margin Classification described above ie. it is highly sensitive to outliers. To solve this, we require soft margin classification. In it, we allow some examples to be misclassified for the bias~variance trade-off.

Mathematically, we relax the constraint by introducing a new slack variable ξ. So, now the constraint is :

And our Loss function looks like:

Here, the role of C is much greater than one assumes at first. C is an approximate measure of how soft margin classifier an SVC is, ie. how many misclassifications are allowed. So, when C is increased, then our classifier is more liberal towards misclassifications. Now, when we merge the loss function and the constraint, we obtain an unconstrained loss function:

Which awfully resembles Hinge Loss. The weights can therefore, be easily calculated using Gradient Descent.

Now, what if under no circumstance can the classes be linearly separated without misclassifying a lot of examples?

In such cases, we project the feature space into a higher dimension feature space, such that the examples in that feature space are linearly separable.

We increase the feature space from 2D to 3D by adding a third dimension Z, such that Z² = X² + Y²

In the above example, the feature space was projected from 2D to 3D, ie. (x,y) to (x,y,z). So, the green points, having larger Z, rose on the Z axis more than the pink points, as shown and thus a plane was able to separate the two classes. This is the fabled “Kernel Trick”. This trick makes a support vector classifier, a support vector machine.

The function that maps the lower dimensional feature space to the higher dimensional feature space is called the Kernel. There are many different types of Kernels which can be used depending on the requirement:

Gaussian RBF Kernel:

Gaussian Kernels are particularly good at plotting circles, spheres and hyper-spheres as decision boundaries. These are also, the preferred choice for plotting non-linear decision boundaries because most things follow gaussian distribution. See Central Limit Theorem for more information, but they map into infinite dimensions, and that is considered by some as a disadvantage.

2. Polynomial Kernel:

Polynomial Kernels are used in specific cases, when you know the degree of the decision boundary, because in those cases the polynomial kernel exploits its advantage of being finite-dimensioned over the likes of the Gaussian RBF kernel.

3. Laplace Kernel:

The Laplacian Kernels are used in Image Processing tasks in general, because the Laplacian of an image highlights regions of rapid intensity change and is therefore often used for edge detection. However it is very sensitive to noise, hence, the image is often smoothed using a Gaussian Smoothing Filter.

4. Sigmoid Kernel:

It is interesting to note that a SVM model using a sigmoid kernel function is equivalent to a two-layer, perceptron neural network. This kernel is quite popular for support vector machines due to its origin from neural network theory.

Interesting facts:

Since the closest examples help in building the marginal hyperplanes, they act as supports for the optimal hyperplane. If they are shifted, the optimal hyperplane itself will shift. Hence, their position vectors are called Support Vectors. Hence, the name for the algorithm. Fancy it now?
RBF in the Gaussian RBF Kernel stands for Radial Basis Function and it allows for a normal Gaussian Kernel to apply on data in more than two dimensions.

References:

An Introduction to Statistical Learning with Applications in R by Gareth M. James, Trevor Hastie, Daniela Witten, Robert Tibshirani
Python Machine Learning by Sebastian Raschka
Neural Networks and Learning Machines by Simon Haykin
MIT-OCW by Patrick Winston

Support Vector Machines

Interesting facts:

References:

Written by Aaryan Garg