Pic Credit: https://unsplash.com/s/photos/3-dimensional

Everything one should know about — Support Vector Machines (SVM)

Published in

Analytics Vidhya

8 min readFeb 23, 2020

For every data science enthusiast who works on machine learning, SVM algorithm works like a magic for almost any supervised problem. It is the most popular machine learning algorithm for a reason. In this article, we will check that reason and all the nitty-gritty details about SVM.
SVM is a supervised machine learning algorithm that can be used for classification or regression problems. The method which is used for classification is called “Support Vector Classifier” and the method which is used for regression is called “Support Vector Regressor”. Although the intuition for both the algorithm is related to support vectors, there is a slight difference in the implementation of these algorithms.
In this article, I am going to explain clearly about both the methods and the difference between them.
Let’s get started!

Support Vector Machine for Classification ~Support Vector Classifier

The main idea behind the support vector classifier is to find a decision boundary with a maximum width that can classify the two classes. Maximum margin classifiers are super sensitive to outliers in the training data and that makes them pretty lame. Choosing a threshold that allows misclassifications is an example of the Bias-Variance tradeoff that plagues all the machine learning algorithms.

When we allow some misclassifications (slack variables), the distance between the observations and the threshold is called as “soft margin”.

How do we know which soft margin is better?

The answer is simple: We use cross-validation to determine how many misclassifications and observations to allow inside of the soft margin to get the best classification.

The name support vector classifier comes from the fact that the observations on the edge that helps us to draw the margin are called support vectors.

Types of SVM

SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.

The main idea behind the SVM:

Start with the data in relatively low dimensions and check whether it can be separated into different classes.
If not, then move the data into a higher dimension.
Find the best hyperplane that separates the higher dimensional data into classes.

Derivation of Maximum Margin in SVM for Linearly Separable Data

Let’s take an example where we have two classes + and — (data points) which we want to classify in such a way that there is a maximum width between the samples of the two classes.

Please note that both the equations are the same for positive and negative samples.

Hence, the generalized equation for the samples to be on either side of the boundary will be:

Now, let’s find the width/margin of the hyperplane.

Now we have to find a solution in such a way that the width of the hyperplane is maximum. We will use equation 2 and equation 3 to find the extreme of the width. To find the extreme of a function with constraints we will use Lagrange Multipliers.

Subtract the equation 2 from equation 3:

Substitute equation 5 and 6 in equation 4:

For linearly separable data points, we need to find the maximum of the expression L, which depends on the dot product of the points Xi and Xj

But what if the data points are not easily separable? How do we decide on how to transform the data?

SVM algorithms use a set of mathematical functions that are defined as the kernel. The function of the kernel is to take data as input and transform it into the required form. Different SVM algorithms use different types of kernel functions. These functions can be of different types. For example linear, nonlinear, polynomial, radial basis function (RBF), and sigmoid.

Polynomial Kernel Function

The polynomial kernel is a kernel function commonly used with support vector machines (SVMs) and other kernelized models, that represents the similarity of vectors (training samples) in a feature space over polynomials of the original variables, allowing learning of non-linear models.

a and b are two different data points that we need to classify.
r determines the coefficients of the polynomial.
d determines the degree of the polynomial.

Here, we perform the dot products of the data points, which gives us the high dimensional coordinates for the data.

When d=1, the polynomial kernel computes the relationship between each pair of observations in 1-Dimension and these relationships help to find the support vector classifier.

When d=2, the polynomial kernel computes the 2-Dimensional relationship between each pair of observations which help to find the support vector classifier.

Similarly for d =3,4,5…

Note: We use cross-validation to select an optimal value for d

The Radial (RBF) Kernel

RBF kernel is a general-purpose kernel; used when there is no prior knowledge about the data and works in infinite dimensions. As the radial kernel finds support vector classifier in infinite dimensions, it’s not possible to visualize what it does. However, in RBF, the closest observations/nearest neighbors have a lot of influence on classifying the new observations.

a and b are two feature vectors of two samples.
The difference between the vectors is then squared, i.e. it gives squared distance.
γ (Gamma) scales the squared distance and thus scales the influence the two vectors/points have on each other. The best value for γ is determined by cross-validation.

When we plug these γ values, we get the high-dimensional relationship between the two points.

Thus any number (not close to zero) is the high dimensional relationship between the two observations that are relatively close to each other. And, any number ( very close to zero) is the high dimensional relationship between the two observations that are relatively far from each other.

Radial basis kernel can work for infinite dimensions because we can represent the function (e power x) in Taylor Series Expansion, till f(x) exists.

Note about Kernel Functions
Kernel functions only calculate the relationship between every pair of points as if they are in higher dimensions, they don’t actually do the transformations.
This trick, calculating the high-dimensional relationships without actually transforming the data to the higher dimension is called “kernel trick”.
The kernel trick reduces the amount of computation required for SVM by avoiding the math that transforms the data from low dimensions to high dimensions.

Support Vector Machine for Regression ~Support Vector Regressor

The Support Vector Regression (SVR) uses the same principles as the SVM for classification, with only a few minor differences. First of all, because the output is a real number it becomes very difficult to predict the information at hand, which has infinite possibilities.

In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem. But besides this fact, there is also a more complicated reason, the algorithm is more complicated therefore to be taken into consideration.

However, the main idea is always the same: to minimize error, individualizing the hyperplane which maximizes the margin, keeping in mind that part of the error is tolerated.

Advantages and Disadvantages of Support Vector Machine

Advantages of SVM

Guaranteed Optimality: Owing to the nature of Convex Optimization, the solution will always be global minimum, not a local minimum.
The abundance of Implementations: We can access it conveniently, be it from Python or Matlab.
SVM can be used for linearly separable as well as non-linearly separable data. Linearly separable data is the hard margin whereas non-linearly separable data poses a soft margin.
SVMs provide compliance to the semi-supervised learning models. It can be used in areas where the data is labeled as well as unlabeled. It only requires a condition to the minimization problem which is known as the Transductive SVM.
Feature Mapping used to be quite a load on the computational complexity of the overall training performance of the model. However, with the help of Kernel Trick, SVM can carry out the feature mapping using the simple dot product.

Disadvantages of SVM

SVM doesn’t give the best performance for handling text structures as compared to other algorithms that are used in handling text data. This leads to loss of sequential information and thereby, leading to worse performance.
Vanilla SVM cannot return the probabilistic confidence value that is similar to logistic regression. This does not provide much explanation as the confidence of prediction is important in several applications.
The choice of the kernel is perhaps the biggest limitation of the support vector machine. Considering so many kernels present, it becomes difficult to choose the right one for the data.

That’s all for this article! Thanks for Reading. Please do share if you find this article useful :)

References:

https://pdfs.semanticscholar.org/7cc8/3e98367721bfb908a8f703ef5379042c4bd9.pdf
https://www.kdnuggets.com/2017/02/yhat-support-vector-machine.html
https://www.javatpoint.com/machine-learning-support-vector-machine-algorithm
https://statinfer.com/204-6-8-svm-advantages-disadvantages-applications/
https://data-flair.training/blogs/svm-support-vector-machine-tutorial/
https://www.saedsayad.com/support_vector_machine_reg.htm
https://www.youtube.com/watch?v=efR1C6CvhmE
https://www.youtube.com/watch?v=Toet3EiSFcM
https://www.youtube.com/watch?v=Qc5IyLW_hns
https://www.youtube.com/watch?v=_PwhiWxHK8o&t=2s