Support Vector Machines (part 2)

Published in

Analytics Vidhya

8 min readMay 20, 2020

SVM non-linear models and kernel-tricks

In the first part of this tutorial regarding SVM-algorithm linear model which I strongly recommend to read first, it was mentioned that SVM is used for solving both regression and classification problems and mostly used for classification as it has a great ability to classify by using either linear or non-linear modeling.

Can we classify non-linearly separable data?

Suppose we have a system like the one shown below, and you would like to use SVMs to classify it. We have seen that it is not possible because the data is not linearly separable. However, this last assumption is not correct. What is important to notice here is that the data is not linearly separable in two dimensions.

Fig.1 Non-linear separable model in 2-D

So what if we can do what is called a polynomial mapping to transform our 2-dimensional space into 3-dimensional space by applying the function

And if we try to linearly separate the mapped data, our decision boundary will be hyperplane in ℝ3.

Fig.2 Transformed dataset is linearly separable in 3-D

Here is a basic recipe we can use to classify this dataset:

Transform every two-dimensional vector into a three-dimensional vector.
Train the SVMs using the 3D dataset.
For each new example we wish to predict, transform it before passing it to SVM classifier.
SVM-Classier depends on dot product between instances each other to calculate best weights (coefficients) of result model.

Of course, you are not forced to transform the data into three dimensions; it could be five, ten, or one hundred dimensions. Unfortunately, there is no perfect recipe, and it will come with experience via trial and error.

What is a kernel?

One of the main drawbacks of the above method (transformation) is that we must transform every example (instance) in addition to perform the dot product between transformed instances each other. If we have millions or billions of examples and that transform method is complex (because of no. of dimensions/features), that can take a huge amount of time. This is when kernels come to the rescue.

import numpy as np

# Two examples/instances:
x1 = [3,6]
x2 = [10,10]# Transform a two-dimensional vector x into a three-dimensional vector.
def transform(x):
    return [x[0]**2, np.sqrt(2)*x[0]*x[1], x[1]**2]# Run transformation
x1_3d = transform(x1)
x2_3d = transform(x2)print(np.dot(x1_3d,x2_3d))
# Result = 8100

Now the the question is: “Is there a way to compute this value, without transforming the vectors?”

And the answer is: Yes, with a kernel!

Let us consider the function (kernel function) in the following code:

def polynomial_kernel(a, b):
    return a[0]**2 * b[0]**2 + 2*a[0]*b[0]*a[1]*b[1] + a[1]**2 * b[1]**2

Then pass the two examples to this new function and check the result:

print(polynomial_kernel(x1, x2)) # 8100

Notice that, The kernel function computes 𝑥1 and 𝑥2 dot product as if they have been transformed into vectors belonging to ℝ3 and it does that without doing the transformation, and without computing their dot product!

So, a kernel is a function that creates new features by returning the result of a dot product performed in another space. Those new features are the key for SVM to find the nonlinear decision boundary.

More formally, we can write:

Definition: For X→ℝ given a mapping function 𝜙:X→V, we call the function 𝐾:X→ℝ defined by 𝐾(𝐱,𝐱′)=⟨𝜙(𝐱),𝜙(𝐱′)⟩𝑣, where ⟨⋅,⋅⟩𝑣 denotes an inner product in V, a kernel function.

The kernel trick

Tip: Applying the kernel trick simply means replacing the dot product of two examples by a kernel function.

The predicted label in the the non-linear system could be expressed as:

Where:

(𝛼) is called a Lagrange multiplier. In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints and for more information you can find it here and here
(S) is the set of support vectors

Tip Looking at this formula, it only need to compute the kernel function on the support vectors and not on all the vectors

Kernel types

Linear kernel

This is the simplest kernel. It is simply defined by:

Where 𝐱 and 𝐱′ are two vectors. In practice, you should know that a linear kernel works well for text classification.

Polynomial kernel

We talk polynomial kernel earlier when we introduced kernels, and the following is the more generic version of the kernel:

Where:

The (𝐶) parameter trades off correct classification of training examples against maximization of the decision function’s margin. For larger values of 𝐶, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower 𝐶 will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy. In other wordsC behaves as a regularization parameter in the SVM.
And 𝑑 , which represents the degree of the kernel.

Tip: Using a high-degree polynomial is dangerous because you can often achieve better performance on your training set, but it leads to what is called overfitting: the model is too close to the data and does not generalize well.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import svm

# import some data to play with
dataset = pd.read_csv('./Datasets/svm.csv')
X = dataset.iloc[:, :-1]  
y = dataset.iloc[:, -1]h = .02  # step size in the mesh# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
poly1_svc = svm.SVC(kernel='poly', degree=1, C=C).fit(X, y)
poly2_svc = svm.SVC(kernel='poly', degree=2, C=C).fit(X, y)
poly4_svc = svm.SVC(kernel='poly', degree=4, C=C).fit(X, y)
poly6_svc = svm.SVC(kernel='poly', degree=6, C=C).fit(X, y)# create a mesh to plot in
x_min, x_max = X['x1'].min() - 1, X['x1'].max() + 1
y_min, y_max = X['x2'].min() - 1, X['x2'].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))# title for the plots
titles = ['SVC polynomial (d=1) kernel',
          'SVC polynomial (d=2) kernel',
          'SVC polynomial (d=4) kernel',
          'SVC polynomial (d=6) kernel']fig=plt.figure(figsize=(10, 7))for i, clf in enumerate((poly1_svc, poly2_svc, poly4_svc, poly6_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.cool, alpha=0.8)
   
    # Plot also the training points
    plt.scatter(X['x1'], X['x2'], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])
plt.show()

RBF or Gaussian kernel

When polynomial-kernels can’t doing well because of difficulty in data distribution, we need to use another, more complicated, kernel: the Gaussian kernel. It is also named RBF kernel, where RBF stands for Radial Basis Function. A radial basis function is a function whose value depends only on the distance from the origin or from some point.

The RBF kernel function is:

As the polynomial kernel, the RBF kernel returns the result of a dot product performed but in ℝ∞. You can find the proof here.

As you can see in below example, the decision boundary using polynomial-kernels is very bad at classifying the data while Gaussian-kernel made an excellent job.

# import some data to play with
dataset = pd.read_csv('./Datasets/svm_rbf.csv')
X = dataset.iloc[:, :-1]  
y = dataset.iloc[:, -1]

h = .02  # step size in the mesh# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
poly3_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.1, C=C).fit(X, y)# create a mesh to plot in
x_min, x_max = 0, 20
y_min, y_max = 0, 20
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))# title for the plots
titles = ['SVC polynomial (d=3) kernel',
          'SVC RBF kernel']fig=plt.figure(figsize=(10, 7))for i, clf in enumerate((poly3_svc, rbf_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])# Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.cool, alpha=0.8)# Plot also the training points
    plt.scatter(X['x1'], X['x2'], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title(titles[i])
plt.show()

Value of gamma

The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors. scikit-learn documentation page.

rbf_svc_g1 = svm.SVC(kernel='rbf', gamma=0.07, C=C).fit(X, y)
rbf_svc_g2 = svm.SVC(kernel='rbf', gamma=2, C=C).fit(X, y)

# title for the plots
titles = ['SVC RBF (gamma=0.07) kernel',
          'SVC RBF (gamma=2) kernel']
fig=plt.figure(figsize=(10, 7))for i, clf in enumerate((rbf_svc_g1, rbf_svc_g2)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)
    
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.cool, alpha=0.8)    # Plot also the training points
    plt.scatter(X['x1'], X['x2'], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title(titles[i])
plt.show()

At this point, we can say that, we covered the key points of SVM-algorithm and for more kernels you can read this article.
For coding SVM in Python using Scikit-Learn module yo can check here and here.

Part 1: Linear Model of SVM)