Support Vector Machines and the Kernel Trick

Published in

The Startup

7 min readNov 13, 2020

The Support Vector Machine(SVM) is a supervised learning algoritm initially proposed by Vladmir Vapnik in 1992. It is one of the widely used algorithms for classification tasks although it can handle regression tasks as well.This article aims to summarize and give a indepth overview of the mathematical concepts on which the Support Vector Machine has been developed.

A few good to know terms:

Feature Space

Feature Space is the space representing all the features of the data. If our data has 2 features, our feature space turns out to be a X-Y Plane. Similarly data with 3 dimensions could be represented using X,Y,Z dimensions

Hyperplane

Hyperplanes are used to define boundaries between various classes.For example, A hyperplane in ℝ2 is represented in the form of a line. Similarly a Hyperplane in the ℝ3 is represented by a plane.

A ℝ2 feature space would have a 1-D hyperplane ,further, a ℝ3 feature space would result in a 2-D hyperplane.

Support Vectors

They are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane.

In contrast to other learning algorithms which aim on learning differences between the various classes, SVM’s aim is to learn the similarities between them.

Intuition behind Support Vector Machine

The Objective of a Support Vector Machine is to find the hyperplane that has the maximum distance between the data points coming from each of the classes.

Figure 2: Points present on the dashed line are called the support vectors and distance from the dashed line to the optimal hyperplane(red line) is known as the margin

For a given training dataset of n points of the form

Where yi is either 1 or -1 indicating the class which xi corresponds to.We can find any hyperplane by using the following equation of a plane

As for the support vectors, they are defined by the equations

and

for the positive and negative class respectively. This prevents data points to fall into the margin ‘w’

The basic steps to create a Support Vector Machine can be divided in to 3 major tasks.

select two hyperplanes which separates the data with no points between them
maximize their margin
Find the average line which is midway from both the support vectors created in step 1. This line is called decision boundary(red line in Figure 2)

The aim of SVM is to maximize the margin between the data points and the hyperplane, This is done by defining a loss function

Facing Non-Linear Data

Till now, we have worked with Linear data where creating a hyperplane is quite intuitive. What happens when we are required to apply this to a data where we cannot apply a linearly separable decision boundary such as the example given below.

One way is to map all the data points to a higher dimension (in this case, 3 dimension), find the boundary, and make the classification. Which when applied to the above data may look like

Figure 4: The same NonLinear Data when plotted in a Higher Dimension

This mapping of our data to a higher dimension has made our task of creating a decision boundary quite easy, but a major trade off is that due to the introduction of a new dimension, calculations become more tedious.

Support Vector Machines can easily be implemented using the scikit library that python has to offer. Below is the bare minimum code that the scikit documentation has to offer.

from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', C = 0.1, gamma = 0.1)
classifier.fit(X_train, y_train)

Kernel Trick

In order to overcome this, we use a mathematical tool which is called the Kernel trick, which allows us to operate in the original feature space without spending much of our computational resources trying to calculate tedious data in the higher dimension.

The easiest method to calculate a kernel is to first calculate f(x), f(y), and then do their dot product.

For example let x and y be defined as

x = (x1, x2, x3)

y = (y1, y2, y3).

The mapping to 9 dimensions would be

f(x) = (x1x1, x1x2, x1x3, x2x1, x2x2, x2x3, x3x1, x3x2, x3x3)

We can define a kernel which would be equivalent to the above equation

K(x, y ) = (<x, y>)²

Let’s take a numerical to get a better intuition of what is being done.

Let,

x = (1, 2, 3)

y = (4, 5, 6).

Then:

f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9)

f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36)

Calculating <f(x), f(y)> , gives us

16 + 40 + 72 + 40 + 100+ 180 + 72 + 180 + 324 = 1024

Instead of doing so many calculations, if we apply the kernel instead:

K(x, y) = (4 + 10 + 18 ) ^2 = 32² = 1024

Making our computations way easier.

This was a very basic kernel that we used. Real life examples require much complex kernels to help keep the dimension of the data as small as possible.

Types of Kernels:

Kernels may be chosen based on the data we are using. Some popular kernels are listed below

Linear Kernel: It is one of the simplest and the most straightforward kernels. Used mainly for data which are linearly separable.It is defined as the inner product <x,y> added with a constant c
Polynomial Kernel: They are useful whenever handling normalized data.
Gaussian Kernel: Also known as the RBF kernel.It can be expanded to infinite dimensions with the help of Taylor Series expansion.The γ parameter defines how much influence a single training example has. The larger it is, the closer other examples must be to be affected

4.Hyperbolic Tangent Kernel: Generally used in neural networks, they are governed by the following equation

5.Bayesian Kernel:it is defined as

Where,

Apart from the above mentioned kernels. There exist a variety of other kernels such as the spherical,wavelet, bayesian, cauchy at our disposal to create even more rigid decision boundaries.Once you are aware of them, you can decide if a particular kernel is well suited to the problem at hand.

Choosing the perfect kernel

Choosing which kernel will be able to map your data in order to create the most appropriate and robust decision boundary is one of the most time consuming and confusing tasks. The best way to start would be to try using the linear kernel if the data is linear,If the data doesn’t seem to be linear, maybe a polynomial or the Gaussian RBF kernel (sklearn.gaussian_process.kernels.RBF — scikit-learn 0.23.2 documentation) would be able to handle the task with playing with the regularization (C) parameter.If the C is higher, the optimization will choose smaller margin hyperplane, so training data miss classification rate will be lower.On the other hand, if the C is low, then the margin will be big, even if there will be miss classified training data examples.

Summary

This article gives a brief insight on how tightly the field of computer science and mathematics are linked. I personally feel Computer Science is a mere tool to implement Mathematical concepts which requires calculations that a human might face difficulties handling. With the Help of the Kernel Trick.We are able to perform calculations in dimensions we cannot even imagine and also decrease the computational complexity at the same time. However, one critical thing to keep in mind is that when we map data to a higher dimension, there are chances that we may overfit the model. Thus choosing the right kernel function (including the right parameters) and regularization are of great importance.

I hope this article gives you a brief yet holistic overview of one of the most popular classification algorithms in Supervised Learning.