Support Vector Machines — An Introduction

Karthik Sundar
Delta Force
Published in
6 min readJan 24, 2022

Support Vector Machines abbreviated as SVM is a machine learning algorithm used in both regression and classification tasks. However, it is widely preferred to use in Classification Tasks. Before getting into SVM, let us look into Maximal Margin Classifiers and Support Vector Classifiers.

Maximal Margin Classifiers

Let’s take the below-shown training dataset. Our task is to classify the green dots from the red dots. In this case, the Maximal Margin Classifier draws a line that is the furthest away from both the closest green and red dots. In ML lingo, this line is usually called a hyperplane. A hyperplane is a subspace whose dimension is one less than the parent dimension. Here the points are in 2 dimensions, so the hyperplane tends to be a 1-dimensional line.

This seems like a solution that may work for all cases, but that’s where we are wrong. Let’s take the case where we have an exceptional green dot much closer to the red dot cluster and far away from other green dots. In this case, Maximal Margin Classifier will give us a hyperplane that will be much closer to the red cluster than the green cluster. When asked to predict, our model will most of the time predict it is a green dot. Thus our model will perform poorly in test data, even though it performs well in training data-set. In ML-Lingo, our model is told to have high bias and low variance. To solve this issue, we have support vector classifiers.

Support Vector Classifiers

So, instead of taking the two closest points from each cluster, we take any two points, one from each cluster. Thus we allow for some misclassification, but our model will perform well in most of the cases. To decide which two points to consider, we use cross-validation.

Support Vector Classifiers are very similar to Support Vector Machines. In the case of Linear Data Sets (i.e Data Sets that can be separated by a linear line), Support Vector Machines work exactly like told above. However, Support Vector Machines really shine when we have a Non-Linear Data Set.

Support Vector Machines for Non-Linear Data Sets

To classify non-linear data, SVM uses the kernel trick. Basically, it transforms the data to a higher dimension and then finds the hyperplane which will classify the transformed data.

There are several kernels out there, but for now, let’s take a good look at the polynomial kernel.

Note: These kernel functions calculate the higher-order relationships as if the data is transformed into a higher dimension. They don’t actually transform the data, thus these kernel functions are efficient.

Polynomial kernel

Polynomial Kernel Function

So Let’s take the case of r = 1/2 and d = 2:

So as you can see we get a dot product in two dimensions( We can ignore the z-value as it’s the same for both the points). This dot product is enough to calculate higher-order relationships. The math required to understand why only the dot product is enough can’t be explained in a single medium post :(

As d increases, the model may start to overfit. Overfit is the case when our model only performs well in our training data set. It is the case of high bias and low variance.

Radial Kernel

The function for the radial kernel is as shown below:

Radial kernel aka Radial Basis Function calculates the relationship between a pair of points as if they are in infinite dimensions. To understand how it achieves this, we will have another look at the polynomial kernel.

We set r = 0 in the polynomial kernel. As shown above, when we add the result for d = 1 and d = 2, we get the dot product for two dimensions. Let’s say we don’t stop at d = 2 and continue till infinity. Then we will get the dot product in infinite dimensions. So how does this relate to radial basis function?

Let’s take the gamma value of 1/2 and expand the radial basis function

Now we use Taylor Expansion to expand the last term

When we try to get the dot product from this, we get

Thus we can see that the radial kernel calculates the relationships as if they were transformed into infinite dimensions.

As gamma increases, the model tends to overfit. On the other hand, if gamma is too low, the model tends to underfit. Generally, we use cross-validation to find a proper value for gamma.

Advantages of SVM:

  • As it uses only some points from the training data set as support vectors, it is memory efficient.
  • Work well in high-dimensional Spaces

Disadvantages of SVM:

  • Doesn’t work well when we have a lot of noise in our training dataset.
  • Requires a lot of time to train if we have large training data sets.

Code

First, we import the necessary libraries

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn import svm

Then we make our sample dataset. Here we make a non-linear data-set

X, y = datasets.make_circles(n_samples=300, noise=0.1)
plt.scatter(X[:, 0], X[:, 1], c=y, marker=".")
plt.show()

Then we instantiate our SVM and fit our data to it. First, we will use the polynomial kernel,

clf = svm.SVC(kernel='poly', C=1.0)
clf.fit(X, y)

Now we will write some code to plot the decision boundary

def plot_decision_boundary(model, ax=None):
if ax is None:
ax = plt.gca()

xlim = ax.get_xlim()
ylim = ax.get_ylim()

x = np.linspace(xlim[0], xlim[1], 30)
y = np.linspace(ylim[0], ylim[1], 30)
Y, X = np.meshgrid(y, x)

xy = np.vstack([X.ravel(), Y.ravel()]).T

P = model.decision_function(xy).reshape(X.shape)

# plot decision boundary
ax.contour(X, Y, P,
levels=[0], alpha=0.5,
linestyles=['-'])

Now we will plot the result,

plt.scatter(circle_X[:, 0], circle_X[:, 1], c=circle_y, s=50)
plot_decision_boundary(nonlinear_clf)
plt.scatter(nonlinear_clf.support_vectors_[:, 0], nonlinear_clf.support_vectors_[:, 1], s=50, lw=1, facecolors='none')
plt.show()

As you can see the polynomial kernel didn’t do much of a good job. Usually, we use something called a radial kernel or radial basis function. It calculates the relationships between two data points as if they have infinite dimensions.

When we use the radial basis function,

clf = svm.SVC(kernel='rbf', C=1.0)
clf.fit(X,y)

We get the following result

As you can see, it did a much better job!

Thus, we have reached the end of this post. Thanks for reading!!

--

--