Support Vector Machine

Rishabh Mall
5 min readJan 6, 2019

--

Source

A Support Vector Machine (SVM) is a discriminative classifier formally defined by a separating hyperplane. In other words, given labeled training data (supervised learning), the algorithm outputs an optimal hyperplane which categorizes new examples. In two dimentional space this hyperplane is a line dividing a plane in two parts where in each class lay in either side.

The SVM algorithm is designed in such a way that it looks for points on the graph that are located directly to the dividing line closest. These points are called support vectors. Then, the algorithm calculates the distance between the reference vectors and the dividing plane. This distance is called the gap. The main goal of the algorithm is to maximize the clearance distance. The best hyperplane is such a hyperplane for which this gap is as large as possible.

Pretty simple isn’t it?

But let’s consider the following example, with a more complex dataset that cannot be divided linearly.

Obviously, this data set cannot be divided linearly. We cannot draw a straight line that would classify this data. But, this dataset can be divided linearly by adding an additional dimension, which we call the Z axis. Imagine that the coordinates on the Z axis are governed by the following constraint:

z = x²+y²

Thus, the ordinate Z is represented from the square of the distance of the point to the beginning of the axis.
Below is a visualization of the same dataset, on the Z axis.

Now the data can be divided linearly. Suppose the magenta line separates the data z = k, where k is a constant. If

z = x²+y²

, then consequently

k = x²+y²

— the formula of a circle. In this way, we can design our linear separator, back to the original number of sample dimensions, using this transformation.

As a result, we can classify a non-linear data set by adding an additional dimension to it, and then, bring it back to its original form using a mathematical transformation. However, not with all data sets, it is possible to turn such transformation with the same ease. Fortunately, the implementation of this algorithm in the sklearn library solves this problem for us.

Difference between Linear and Non-Linear Data :

In figure A we can separate the target labels linear with a line (like Support Vector Machines do classification with a decision line). A linear classifier can do this with a linear combination of characteristics. We could use e.g. Support Vector Machines do build a model, but we could also use many other linear classification methods like quadratic classification.

In figure B we can not separate the target labels linear. The data is more complex divided. Therefore we can not just use a linear classification method. Fortunately Support Vector Machines can do both, linear and non-linear classification. Lets first take an easier linear example to get an introduction about Support Vector Machines. Later we will look at non-linear classification with Support Vector Machines and we will see how it works with the kernel trick.

Working of SVM :

We just drawing a line to separate the different labeled classes from each other. But how does Support Vector Machines solve this problem? The SVM want to find are so called maximum-margin hyperplane.

The hyperplane is the line with the biggest margin to both groups. We have called the line above decision line, but the mathematical correct term is hyperplane, because in dimensions higher than two, it will be not a line anymore.

We will give the Support Vector Machine algorithm a bunch of labeled vectors as a training set. All vectors are p dimensional, p is the number of features we have in our training set. To find the maximum margin hyperplane, we have to maximize the margin to every nearest point of each target group. In a binary classification, we can declare the labels of the two target groups as -1 and 1. The hyperplane as a set of points can be described as

where

is the normal vector to the hyperplane and b is a bias. A normal vector simply is an orthogonal standing vector to a line or plane. If you are familiar with linear algebra, this may look familiar to you. It is like the Hesse normal form, except that

does not have to be a unit vector.

The parameter

determines the offset of the hyperplane from the origin along the normal vector

. With the use of the hyperplane (decision line) the model can now classify new entries.

There are actually different sub classifier, who behave different. The Soft Margin Classifier allows some noise in the training data, but on the other side the Hard Margin Classifier does not allow noise in the training data.

How does the Kernel method works?

The kernel method are contains are so called kernel function. These function map the non-linear separable input space into a higher dimensional linear separable feature space. And in this new higher dimensional linear separable feature space Support Vector Machines can work as normal. The kernel method then maps the solutions back, so that in the non-linear separable input space you then have a non-linear solution.

In the example above we have a two dimensional feature space, which is non-linear. With the kernel function we can map the input space into a three dimensional feature space. In this feature space we then can separate the training set linear. When we map the solution back to the input space we get a non-linear solution.

--

--