Machine Learning Algorithms: Support Vector Machines

Published in

Analytics Vidhya

4 min readMar 19, 2021

In this third article of the Machine Learning algorithms series, I will be discussing the most popular supervised learning algorithm, Support Vector Machines. They can be used for both classification and regression problems but they are widely for their classification capabilities. So let’s start classifying using SVM!

The ultimate aim of the support vector machines algorithm is to identify the best hyperplane that can separate the classes as far as possible i.e., one that can create the largest margin.

A classification problem with 2 different classes of data

By looking at the figure above, we can identify the best hyperplane for equally splitting the data. But how does the algorithm identify this? Support vectors are data points that are closer to the hyperplane and influence the orientation of the hyperplane. The margin is defined as the distance between the support vectors of both classes. Thus support vectors are used to maximize the margin of the classifier.

In order to classify, we need to first identify the decision boundary. We will assume that anything below the decision boundary is negative and above will be positive. So, mathematically,

Let's arbitrarily pick up two points x1 in the negative class and x2 in the positive class.

We can represent it as X2= X1+λ * W, where λ is the distance between the points.

The ultimate aim of the support vector machine is to maximize this distance as much as possible. We can interpret this in another way.

Subjected to:

In most cases, the data points may not be linearly separable as shown in the figure above. So, the hyperplane created must be able to allow some misclassification to occur. Such a situation is defined by a soft margin. A slack variable is introduced to the equation to quantify the same.

where C is the regularisation parameter and 𝜖i is the slack variable. The regularisation parameter controls the level of misclassification allowed in the model such that a low C value allows more outliers than a higher C value.

Kernel Function

When data becomes linearly non-separable the kernel function maps the data to a higher dimension which in turn helps in finding a suitable hyperplane for classification.

There are different types of Kernel functions available:

Linear Kernel: They used when the data is linearly separable.

Polynomial Kernel:

Here c is an arbitrary constant and d is the polynomial degree. Some polynomial kernel becomes linear when c=0 and d=1

Radial Kernel: They are also called Gaussian Kernel.

where γ is a hyperparameter that can control the variance of the model. When γ is small it behaves like a linear model. ||xi-xj|| represents the Euclidean distance between the points.

Pros:

Highly suitable for high dimensional data

2. Suitable for non-linear models

3. Works well with even unstructured and semi-structured data like text, images, and trees

Cons:

Long training time for large datasets
Hyperparameter selection makes them highly computationally expensive.

Links to previous articles of the Machine learning algorithms series:

Machine Learning Algorithms: Naïve Bayes Classifier and KNN Classifier

In this second article of the Machine Learning algorithms, I will be focusing on the Naïve Bayes Classifier and KNN…

medium.com

Machine Learning Algorithms: Logistic Regression

One of the most famous definition by Tom Mitchell states machine learning as “a computer program of performance P is…