Support Vector Machines. Unwinded.

Ajay Unagar
Data Science Group, IITR
6 min readMar 25, 2017

After completing 4 Algos of our 12 Algos in 12 Days series, here is one of the most popular and most talked about Algorithm since its development. Best thing I like about SVM is that it’s completely based on Mathematical Optimization problem.

Don’t look at the trees. Focus on Supports for now :D

Support Vector Machine is a supervised machine learning algorithm which can be used for both classification and regression problems. SVM is a generalization of a simple classifier called maximum margin classifier.

Mathematics behind Algorithm:

Mathematics of SVM is little bit tricky. And I can not avoid it. Sorry for little bit more maths for this Blog. But then one should remember,

Small minds discuss persons. Average minds discuss events. Great minds discuss ideas. Really great minds discuss mathematics.

The goal of SVM is to identify an optimal separating hyperplane which maximizes the margin between different classes of the training data.

Hyperplane: It is basically a generalization of plane.

  • in one dimension, an hyperplane is called a point
  • in two dimensions, it is a line
  • in three dimensions, it is a plane
  • in more dimensions you can call it an hyperplane
Green one is plane! Sadly we can not only imagine hyper plane! Dimensionality sucks ;/

We can define margin in two ways.

  • Functional Margin: Let’s focus on binary classification {-1, 1} problem for now. We’ll write our classifier as
g(z) = 1 if z ≥0, -1 otherwise

Now functional margin of an ith observation is defined by:

Now functional margin of classifier is smallest of the functional margins of individuals training examples.

  • Geometric Margin: It is defined by,

Optimal Separating Hyperplane: Idea is to choose the hyperplane which is at maximize margin from both the classes of training data. Maximizing the distance between the nearest points of each class (minimum of functional margins of all the examples) and the hyperplane would result in an optimal separating hyperplane.

All in one!

Support Vectors: Support Vectors are simply the co-ordinates of data points which are nearest to the optimal separating hyperplane.

How to find this optimal separating Hyperplane?

So, basically what we want to do is

But now, for us we can set functional margin to 1. Because it’s nothing but a scaling which should not affect optimization problem. So, this maximization problem converges to minimization problem. Which is,

I know by now you’ll be feeling

Okay I admit it. This was too much, but much needed as well. After this you can read Andrew Ng’s notes on SVM.

What if Data is not completely separable?

In practice, real data is messy and cannot be separated perfectly with a hyperplane. Now in this case there are no two hyperplanes which separate the data with no points between them.

So, the constraint of maximizing the margin of the line that separates the classes must be relaxed. This is called the soft margin classifier. This allows some points in the training data to violate the separating line. A tuning parameter is introduced called C which defines the amount of violation of the margin allowed.

So, now relaxed constraint looks like this:

Where C is a penalty parameter. This is called a soft margin classifier, because we allow that data of one class can be in the region of the other class but we’ll put a penalty on that. The smaller the value of C, the lesser the sensitivity of an algorithm to the training data and it tries to find separating plane with largest margin which results in lower variance but higher bias. On the other hand The larger the value of C, the higher the sensitivity of an algorithm is to the training data because it plane tries to separate all classes correctly which results in higher variance and lower bias.

What if Data is not Linearly separable?

In the case of non-linearly separable data points SVM uses the kernel trick. The idea stems from the fact that if the data cannot be partitioned by a linear boundary in its current dimension, then projecting the data into a higher dimensional space may make it linearly separable.

Function used to project data in higher dimension is called Kernel function.

Type of Kernels

Where Xi and Xj are observation points.

Here, controlling parameter is gamma for all the kernels, except linear. From my experience, RBF is the most popular kernel choice in SVM.

SVM for multiclass Classification:

We have talked about binary classification problem using SVM. But, how to extend it for multiclass classification problem is still an ongoing research.

Mainly there are two methods for Multiclass classification in SVM.

  • One-against-one method: This method constructs k(k − 1)/2 classifiers where each one is trained on data from two classes.
  • One-against-all method: It constructs k SVM models where k is the number of classes. The mth SVM is trained with all of the examples in the mth class with positive labels, and all other examples with negative labels.

Is it really beautiful?

Pros:

  • It is really effective in higher dimension. If you have more features than training examples, most of the algorithms perform very bad, but SVM is the only algorithm which can saves you in this situation.
  • Best algorithm if you data are separable. That two classes are not mixed.
  • Only support vectors affect the optimally spaced hyperplane. So, it is less affected by outliers.

Cons:

  • On large dataset it takes too much time. Mainly because of kernel function calculations and finding optimal hyperplane in higher dimensions.
  • Can not perform well in case of overlapping classes.
  • Can only give you 0–1 classification. Probably estimates computation are really expensive.

PARAMETERS:

  • kernel: ‘linear’, ‘polynomial’ or ‘rbf’. Type of kernel you want to use
  • C: penalty on data point on the opposite side of separating Hyperplane.
  • gamma: Kernel parameter. Not applicable for linear kernel though.
  • degree: In case of polynomial kernel you can specify degree of polynomial you want to use.

Yes, SVM has very less parameter to tune on. And generally it works best with default values.

Implementation in Python:

In Python, scikit-learn is a widely used library for implementing machine learning algorithms, SVM is also available in scikit-learn library.SVM assumes that your inputs are numeric. If data have categorical inputs you may need to convert them to binary dummy variables . Here, SVM is applied on a Image segmentation data in which there are 19 continuous variables used to predict the type of Image.

Accuracy =0.94

Tuning parameters:

The best parameters- C=100 and kernel=”linear”

This is simple python implementation of SVM using sklearn library.

REFERENCES:

Footnotes:

This is so far the longest Blog of the series. This blog has tried to touch the basics of each aspect of Support vector machines. To understand it thoroughly refer to links above.

Thanks for reading :)

Hit ❤ if you enjoy reading this.

Co-author: Roopkishor.

--

--