Machine Learning Algorithms: Support Vector Machines

Vishnu Satheesh
Analytics Vidhya
Published in
4 min readMar 19, 2021

In this third article of the Machine Learning algorithms series, I will be discussing the most popular supervised learning algorithm, Support Vector Machines. They can be used for both classification and regression problems but they are widely for their classification capabilities. So let’s start classifying using SVM!

Photo by Annie Sowards on Unsplash

The ultimate aim of the support vector machines algorithm is to identify the best hyperplane that can separate the classes as far as possible i.e., one that can create the largest margin.

A classification problem with 2 different classes of data

By looking at the figure above, we can identify the best hyperplane for equally splitting the data. But how does the algorithm identify this? Support vectors are data points that are closer to the hyperplane and influence the orientation of the hyperplane. The margin is defined as the distance between the support vectors of both classes. Thus support vectors are used to maximize the margin of the classifier.

In order to classify, we need to first identify the decision boundary. We will assume that anything below the decision boundary is negative and above will be positive. So, mathematically,

Let's arbitrarily pick up two points x1 in the negative class and x2 in the positive class.

We can represent it as X2= X1+λ * W, where λ is the distance between the points.

The ultimate aim of the support vector machine is to maximize this distance as much as possible. We can interpret this in another way.

Subjected to:

In most cases, the data points may not be linearly separable as shown in the figure above. So, the hyperplane created must be able to allow some misclassification to occur. Such a situation is defined by a soft margin. A slack variable is introduced to the equation to quantify the same.

where C is the regularisation parameter and 𝜖i is the slack variable. The regularisation parameter controls the level of misclassification allowed in the model such that a low C value allows more outliers than a higher C value.

Kernel Function

When data becomes linearly non-separable the kernel function maps the data to a higher dimension which in turn helps in finding a suitable hyperplane for classification.

There are different types of Kernel functions available:

Linear Kernel: They used when the data is linearly separable.

Polynomial Kernel:

Here c is an arbitrary constant and d is the polynomial degree. Some polynomial kernel becomes linear when c=0 and d=1

Radial Kernel: They are also called Gaussian Kernel.

where γ is a hyperparameter that can control the variance of the model. When γ is small it behaves like a linear model. ||xi-xj|| represents the Euclidean distance between the points.

Pros:

  1. Highly suitable for high dimensional data

2. Suitable for non-linear models

3. Works well with even unstructured and semi-structured data like text, images, and trees

Cons:

  1. Long training time for large datasets
  2. Hyperparameter selection makes them highly computationally expensive.

Links to previous articles of the Machine learning algorithms series:

Hope you had a good read. Give a clap to show your support and follow me for more articles ☺

--

--

Vishnu Satheesh
Analytics Vidhya

Big fan of data,cloud and AI. 3+ years of experience in data science.