A Road to Support Vector Machines!

A simple way to get in touch with SVM

Shreya Maheshwari
7 min readNov 13, 2020

Let’s visualize what is Support Vector Machine in layman’s terms, so that now when you come across this term you are no longer scared of it.

Photo by Jake Bittle

SVM is similar to the road in the given image. It separates the trees on both sides by the widest lane possible. This widest lane is a support vector machine that finds the widest gap between both sides of the road. The trees and the road sign close to the road are the support vectors.

So, now you can remember it as a “widest road machine”!

Introduction

A supervised learning algorithm that helps in classification, regression and outlier detection in structured as well as unstructured and semi-structured data unlike other classification algorithms like logistic regression. It produces significant accuracy and handles non-linear data efficiently is called Support Vector Machine. It can be used for both classification and regression problems but preferably classification problems. It is intrinsically used for two-class classification problem.

So, if you want to go for text data or images classification, Support Vector Machine is a beautiful algorithm which deals with complex data as well. It is highly preferred since it produces significant accuracy with less power.

The idea is pretty simple: it aims at creating a hyperplane that separates or classifies the data into classes or say creating a road that separates the two sides of the road!

Let’s hit the road!! :)

Starting with an example:

Let’s say we have a classification problem that classifies data points based on two parameters ( x1 and x2):

Now you can see, we have two groups- on the left a group of red circles and on the right a group of green cross(say positives and negatives).

Lets take a new data point now and find whether the data belongs to one class or the other. So, for that first we need to find a separation between the two classes.

The separation here will be :1-D line.

When the number of parameters will increase(x1, x2, x3) the dimension of this graph will increase and then we might need a different plane for data represented in more dimensions since a line alone will not be a proper separator. So, hyperplane is used to separate or classify the data points in any given dimension.

Hyperplane is a subspace whose dimension is one less than that of its ambient space. If a space is 3-dimensional then its hyperplane are the 2-dimensional planes, while if the space is 2-dimensional, its hyperplane are the 1-dimensional lines, similarly that for a 1- dimensional plane is a point.

Next comes the question, where to place the hyperplane so that it classifies the data efficiently?

The widest possible road! Yes, we place the hyperplane where it is in the middle of the widest possible distance between the classes called margin.

Margin is the perpendicular distance between the separating hyperplane and the points of the classes closest to the hyperplane called support vectors.

The goal is to maximize the margin so that when a new data point comes it can be classified properly. The optimal position of the hyperplane comes from the maximum margin.

Soft Margins

In a linearly separable case, Support Vector Machine tries to find the hyperplane that maximizes the margin, with the condition that both classes are classified correctly. But in reality, datasets are probably never linearly separable, so the condition of 100% correctly classified by a hyperplane will never be met. Let’s face it data without outliers is a misconception!

Support Vector Machine tolerates a few incorrect classifications. Choosing a threshold that allows misclassifications plagues all of machine learning. Before allowing misclassifications we picked a threshold that was very sensitive to the training data(low bias) and it performed poorly when we got new data(high variance). After allowing misclassifications, when we picked threshold it was less sensitive to the data(higher bias) and it performed better when we get data(low variance).

Allows misclassification

It creates soft margins in cases when either the point is on the wrong side of the separating hyperplane but still on the correct classifier or the point is on the wrong side of the separating hyperplane as well as wrong side of the classifier.

How to tell whether one soft margin is better than other?

The answer is simple: we use Cross Validation to determine how many misclassifications and observations to allow inside of soft margin to get the best classification.

Other than soft margins another problem which Support Vector Machine tackles is that of data which non separable linearly using kernel trick.

Kernel Trick

Suppose we have a set of data points for two classes. Can you draw a line that separates them? No, this type of data is linearly non-separable. When we cannot classify the data points in the given dimension we have to go for a higher dimension using kernel.

Kernel is a set of mathematical functions take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.

Kernel functions only calculate the relationship between every pair of points as if they are in the higher dimension, they don’t actually do the transformation. This trick, transforming the high dimensional relationships without actually transforming it into higher dimension, is called the kernel trick.

The Kernel trick reduces the amount of computation required for the Support Vector Machines by avoiding the math that transforms the data from lower dimension to higher dimension and it makes calculating relationships in the infinite dimensions by using Radial Kernel possible.

However, one critical thing to keep in mind is that when we map data to a higher dimension, there are chances that we may overfit the model. Thus choosing the right kernel function (including the right parameters) and regularization are of great importance.

Few widely used kernel functions are:

Polynomial Kernel

When we want to change our data from lower to higher dimension to get a different separator hyperplane for a non linear separable data type we use polynomial kernel.

where d is the degree of polynomial.

when d=1, the polynomial kernel computes the relationship between each pair of observations in 1- Dimension and these relations are used to find a support vector classifier

when d=2, the polynomial kernel computes the relationship between each pair of observations in 2- Dimension and these relations are used to find a support vector classifier.

Similarly we keep on increasing the d and find support vector classifiers.

We can get a good value of d using Cross Validation.

Radial Basis Function Kernel

The radial kernel finds classifiers in infinite dimensions. When observed on a new data point- it behaves as a weighted nearest neighbor model. The closest observations have a lot of influence on how are we going to classify the new data point and observations that are further away have little influence on the classification of the same observation.

The higher the gamma, the more influence of the features will have on the decision boundary.

Advantages of SVM

  1. High dimensional input- A lot of problems occur in other algorithms when it comes to high dimensional input(say 1000-D) but SVM takes care of this type of data efficiently
  2. Sparse document vectors- This is where we tokenize the words in document so that we can run our machine learning algorithms over though.
  3. Regularization Parameter- A parameter that helps to figure out whether we are going to have a bias or over-fitting and automatically helps in solving them.
  4. Memory efficient-It uses a subset of training points in the decision function (called support vectors), so it is also saves memory.

Disadvantages of SVM

  1. Long training time: Support Vector Machine doesn’t perform well when we have large data set because the required training time is higher
  2. Choosing an appropriate Kernel function is difficult: Choosing an appropriate Kernel function (to handle the non-linear data) is not an easy task. It could be tricky and complex. In case of using a high dimension Kernel, you might generate too many support vectors which reduce the training speed drastically.
  3. Requires Feature Scaling: One must do feature scaling of variables before applying Support Vector Machine.

Congratulations on making it till the end!

I hope you found it helpful, open for feedback and suggestions below.

Share if you like!

--

--