Support Vector Machine — with Math — part 1
SVM is one of the popular supervised machine learning algorithms. It is used for both regression and classification task. It is a discriminative classification algorithm.
Naïve Bayes is a generative classifier — a model is learnt that describes the distribution of each underlying class and the generative model is used to probabilistically generate labels for new points. But as termed SVM is a discriminative classifier, rather than modelling each class we find a line or curve(in 2 dimensions) or manifold(in higher dimensions) to separate classes from each other.
If you can relate to the diagram the formal definition of SVM is “The main objective of SVM is to find the optimal linearly separating hyperplane which maximizes the margin.”
INTUITION BEHIND SVM:
Consider the simple case of a classification task, in which the two classes of points are well separated
A linear discriminative classifier would attempt to draw a straight line separating the two sets of data, and thereby create a model for classification. For 2 D data like that shown here, this is a task we could do by hand. But immediately we see a problem there is more than one possible dividing line that can perfectly discriminate between the two classes.
Evidently our simple intuition of “drawing a line between classes” is not enough, and we need to think a bit deeper.
Support vector machines offer one way to improve on this. The intuition is this: rather than simply drawing a zero-width line between the classes, we can draw around each line a margin of some width, up to the nearest point.
In support vector machines, the line that maximizes this margin is the one we will choose as the optimal model. Support vector machines are an example of such a maximum margin estimator.
We now we can use sklearn’s SVC (Support vector classifier ) which can be used to separate classes.
In the diagram in the previous line we see a dividing line that maximizes the margin between the set of points. Few of the training points just touch the margin.. These points are the pivotal elements of this fit, and are known as the support vectors, and give the algorithm its name. In Scikit-Learn, the identity of these points are stored in the support_vectors_attribute of the classifier.
A key to this classifier’s success is that for the fit, only the position of the support vectors matter
Any points further from the margin which are on the correct side do not modify the fit.
Technically, this is because these points do not contribute to the loss function used to fit the model, so their position and number do not matter so long as they do not cross the margin.
So, when we are choosing optimal hyperplane we will choose one among set of hyperplane which is highest distance from the closest data points(support vectors). If optimal hyperplane is very close to data points then margin will be very small and it will generalize well for training data but when an unseen data will come it will fail to generalize well as explained above. So our goal is to maximize the margin so that our classifier can generalize well for unseen instances.
Summary of how SVM training works
So in the training phase, we fix a subset of training points from which we compute the similarity of any test point. These selected training points are called support vectors, since only these points will support our decision of selecting the class of a test point. Our hope is that our training phase finds as few as support vectors so that we must compute fewer number of similarities.
Now once we have selected support vectors, we assign a weight to each support vector, which basically tells how much importance we want to give to that support vector while making our decision. We don’t give importance to only a single training point, instead we give each support vector a separate importance.
To take decision , we just compute a weighted sum of similarities from test point to each of the support vector and compute the class based on this weighted sum.
Concepts to know before deep diving into SVM.
Length of a vector:
The length of a vector x is called its norm, which is written as ||x||. The Euclidean norm formula to calculate the norm of a vector x = (x1,x2,…,xn) is:
Direction of a vector:
The direction of a vector x = (x1,x2x1,x2) is written as w, and is defined as:
If we look at the figure , we can see that.
Thus, the direction vector w can also be written as:
The norm of a direction vector is always 1. Therefore w is also called unit vector.
Dot product of 2 vectors:
The dot product of two vectors returns a scalar. It gives us some insights into how the two vectors are related.
The figure below shows two vectors x and y and the angle θθ between them. The geometric formula of dot product is defined as:
But from above diagram we see that
Therefore we get:
substituting cos(theta) in formual we get :
This is the algebraic formula of dot product. In general, dot product can be computed as the following for two n-dimensional vectors:
Hyperplane
We have seen the term hyperplane earlier . But what is it exactly ? It is used to separate higher dimensional data (3D).
Let’s look at the two-dimensional case first. The two-dimensional linearly separable data can be separated by a line. The equation of the line is y=ax+b. We rename x with x1and y with x2 and we get:
ax1−x2+b=0
If we define x = (x1,x2) and w = (a,−1), we get:
w⋅x+b=0
This equation is derived from two-dimensional vectors. But in fact, it also works for any number of dimensions. This is the equation of the hyperplane.
Once we have the hyperplane, we can then use the hyperplane to make predictions. We define the hypothesis function h as
The point above or on the hyperplane will be classified as class +1, and the point below the hyperplane will be classified as class -1.
So basically, reiterating the goal of the SVM learning algorithm is to find a hyperplane which could separate the data accurately. There might be many such hyperplanes. And we need to find the best one, which is often referred as the optimal hyperplane.
Metrics to compare hyperplanes:
Let’s first consider the equation of the hyperplane w⋅x+b =0 . We know that if the point (x,y) is on the hyperplane, w⋅x+b=0. If the point (x,y) is not on the hyperplane, the value of w⋅x+b could be positive or negative. For all the training example points, we want to know the point which is closest to the hyperplane. We could calculate β=|w⋅x+b| (perpendicular distance). To formally define the problem:
Given a dataset , we compute β for each training example, and B is the smallest β we get.
If we have s hyperplanes, each of them will have a Bi value, and we’ll select the hyperplane with the largest Bi value.
The problem with the previous metric is that it could fail to distinguish between a good hyperplane and a bad one. Because we take the absolute value of w⋅x+b, we could get the same value for a correct and an incorrect hyperplane. We need to adjust this metric.
We could use the information of the label y. Let’s define f=y(w⋅x+b), and the sign of f will always be positive if the point is correctly classified and will be negative if incorrectly classified.
To make it formal, given a dataset D, we compute f for each training example, and F is the smallest f we get. In literature, F is called the functional margin of the dataset.
When comparing hyperplanes, the hyperplane with the largest F will be favorably selected.
It looks like we found the correct metric. However, this metric suffers from a problem called scale variant. For example, we have two vectors w1=(1,4) and w2=(10,20)). Since they have the same unit vector u=(0.6,0.8), the two vectors w1 and w2represent the same hyperplane. However, when we compute F, the one with w2 will return a larger number than the one with w1. We need to find a metric which is scale invariant.
Now let’s see the final version which would be scale invariant too:We divide f by the length of the vector w. We define:
To make it formal, given a dataset D, we compute γ for each training example, and M is the smallest γ we get. In literature, M is called the geometric margin of the dataset
When comparing hyperplanes, the hyperplane with the largest M will be favorably selected.
We now have a perfect metric for comparing different hyperplanes. Our objective is to find an optimal hyperplane, which means we need to find the values of w and b of the optimal hyperplane.
The problem of finding the values of w and b is called an optimization problem.
In the next part I will come up with how to derive the svm optimization problem (hint duality).
References: