Basics of SVM for Classification.

9 min readSep 14, 2018

Support vector machine falls under category of supervised ML and is used both for regression and classification.
I will focus for classification implementation of SVMs for this post.
We know that

1.0 Dimension and point interpretation

Support vectors are sub set of data points which are on a plane or hyper-plane. The whole SVM revolves around its support vectors.

How it works?
Please note that — Data in this whole article is assumed to have binary classification.
It tries to find the separating line — which is called decision boundary (same as in Logistic regression).
There can be infinite numbers of lines that can separate the data in a different group.
How does SVM decides its decision boundary is different from that of Logistic regression — SVM uses widest street approach — it tries to maximize the distance between its positive and negative class points while deciding its decision boundary or decision surface.
This was basic idea of what SVM does.

1.1 Infinite numbers of line that can separate data

There exist infinite numbers of line that can separate the data points.
For image 1.1:
The dashed line separates the data leaving maximum space between the data points.

After finding the dashed line- which acts as separating surface in SVM- we can find the negative and positive plane as shown in Image 1.2

Thus, a small subset of data point plays part in deciding decision surface of SVM- these points are called support vectors, other points do not provide any input in deciding the decision surface- contrasting to logistic regression where the whole data is used to decide the decision surface.

Why wide margin?
Wider margins ensures less overfitting of model- slight change in train data would not affect the performance of a model.
Also, points which are close to decision boundary generally has a 50% probability of falling under a specific class — which in turn makes the model uncertain- maximizing the margin overcomes this problem and makes the model certain.

How to find the plane?

Geometrically:

Divide two classes, draw a convex hull such that all points are either inside or on the polygon
Find shortest line connecting the hulls
Bisect the line in 2 equal parts. — the plane which bisects the line is the decision surface.

Mathematically:

Let us assume we have found the plane π which separates the data as shown in Image 1.4

The plane π+ which defines positive class plane and plane π- that defines negative class plane
plane

W is the perpendicular to all the plane.

We selected two other hyper-planes such that both are equidistant and parallel to the plane π , π+ and π-

For simplicity and convenience, we will take k=1

(any number would give the same result)
Constraints for selecting hyper-planes:

Let the distance between two planes π- and π+ is “d”, we have a point X0 on plane π-.
We want to find the distance “d” of margin, note that “d” is a scalar.
That is why we cannot directly add X0 with d to get the margin distance.
For this we will have to convert the scalar “d” into vector.
From the equation of plane π+ we have perpendicular to the plane
which is “W”.

Let us get W’s unit vector, u=w/||w||

As “u” is unit vector of “W” its magnitude will be 1 but it will be in same direction as “W”
Multiply u and d, we get a vector “h” — with magnitude equal to distance “d” of margin and direction =” W”
h= d. u

h= d .W / ||W||_ _ _ _ By using value of u

So, h is the vector which is of same magnitude as margin.
Let us add vector h in our vector X0,

We get distance d= 2/||W||
To maximize it we will have to take smaller norm so that we get bigger margin.
Inversely, we would minimize the ||W||/2

How to find the class for the unknown data point:
Let us consider a vector Xi, to find in which class it belongs.

Multiply by Yi values both sides

Real world data:
In real world data will mostly not follow linear separation. As in figure below, for points in a margin — P1, P2 and P3
Assume that the margin length is 2, and P3 is at distance of 1.5

Rewrite the P2 and P1 w.r.t 1
P1: 1–0.5 = 0.5 (We get 0.5 which becomes our new variable ξ (Zeta)
P2: 1–1.5= 0.5 ( we get 1.5 which is ξ for P2)
P3: 1–2.5 = -1.5 ( we get 2.5 which is ξ for P3)

So, New variable ξ (slack variable) will be “0” for all correctly classified points and will be >0 for misclassified points.
As the ξ increases the point is farther away from correct hyper-plane in incorrect direction.

SVM formulation now becomes:
Initially:

C is hyper-parameter (Regularization parameter) of Soft SVM
and is always >0;
C impacts on avoiding misclassification errors.
C increases the optimizer will choose smaller margin hyper-plane — it gives more imp to making mistakes hence over-fitting
C decreases the optimizer will choose larger margin hyper-plane — it makes the model under-fit — high bias model.

Loss:

Loss functions are used to penalize the misclassification in modelling.
Hinge loss is used for maximum margin classification and that is why it is loss function of SVM.

Again, another difference from Logistic regression -> SVM uses Hinge loss and Log Reg uses Logistic loss.
Hinge loss is straight line from -∞ to 1 and then it becomes equal to 0.
It means that once something is classified correctly then It will have 0 penalty.
For ξ >=1 hinge loss = 0 and ξ<1 then hinge loss =1-ξ
So, hinge loss = max(0, 1-ξ)

We can conclude following from the above understanding

Dual form of SVM:
Why we need dual form because in the soft margin SVM often referred to as “Primal form” — We cannot work with similarity matrix (feature mapping), the dual form can take similarity matrix for providing the predictions.

The dual form the optimization problem contains XiT.Xi (Xi Transpose Xi) which cares for the feature mapping.

Dual form has α Which are Lagrange multipliers , which is a little bit more math for a basic SVM understanding

In the Dual form
- For every Xi we have αi
- Xi only occur in dual form of SVM
- f(xq) =∑ αi yi xiT xq + b
- αi > 0 only for support vectors
- αi =0 for non-support vectors

So, to calculate f(xq) only thing that matters is support vector à because for non-support vectors only “b” will remain in f(xq).

Kernel trick:

All the above explanation holds only when data is linearly separable or almost linearly separable, which is not possible for most cases in real world.

Assume we must separate collection of data which is not linearly separable, for such case assume data in higher dimension (like throwing all data points into air — this will make sure that the data will be linearly separable)

Please consider the image shown.
Kernelization can be thought of as a in built feature transform for in SVM.

By changing data from 2D to 3D we can see that how it become separable by a plane.

Kernelization transforms data of dimension d into dimension d’, d’>d
Kernel functions became used in SVM by its Dual Form à as dual form contains XiT Xi

Often similarity (Xi,Xj) is written as K(Xi,Xj) (k= Kernel function)
Replacing Xi and Xj in Dual form is called Kernel Trick.
f(Xq) =∑ αi yi K(xi, xq) + b — this has now became kernel SVM

For Kernel SVM hard part of Feature transform is partially replaced by choosing the correct kernel.

There exist many domain specific kernels like Polynomial Kernel, String kernel, genome kernel, graph-based kernel etc.

RBF (Radial Basis Function) Kernel:
RBF kernel behaves somewhat like similarity as in KNN.
In RBF Kernel two samples x and x’ for some input space is defined as

||X-X’||² is Squared Euclidean distance and σ is hyper-parameter

As d12: distance between X and X’ increases the K(X,X’) decreases because K(X, X’)= 1/exp((d12)^2/2σ^2

As d12 increases (d12)^2 increases that implies 1/exp(d12)^2 decreases and hence kernel value decreases.

This means that RBF kernel somewhat behaves like similarity.

So, K(X0, X) < K(X, X’) as d01 < d12

Different cases of σ :
As distance increases whether in +ve or -ve direction(as we have distance square),Kernel value decreases.
(Graphs taken from google)

Nu-SVM:

Nu SVM controls error and support vectors.
In Nu SVM the regularization parameter is Nu and not C as in soft SVM.

Nu is upper bound on fraction of error and lower bound on fraction of Support vectors.
It means that, if we do not want more than 10% of error in prediction then our nu = 0.1, if error can be minimum 1% then nu = 0.01

Now let us consider an example where nu=0.01
So no of errors allowed <= 1%
No of support vectors >=1%

If no of data points = 1,00,000

Then, support vectors would be greater than or equal to 1000 Which is very large number.

So, nu SVM controls error but it might affect performance as it creates a lower bound on number of Support vectors.

Conclusion:

SVMs are useful for both linearly separable and non-linearly separable data.
By using Dual Form Similarity matrix (feature mapping) can be considered for SVM.
Feature transform and Feature Engineering in SVM is replaced by Kernelization so it is replacing the hard part in data exploration.
SVM works well with higher dimension, as SVM itself converts data in higher dimension for predicting results.
It has very little impact of outliers as it only considers support vectors to train the model except in RBF with smaller α as it behaves like KNN with small k so it might get impacted by outliers.
We cannot get feature importance in SVM; however, we can do Forward feature selection.
Model is not much interpretable except for linear SVM.
SVM however takes more time in training — So typically with larger data SVM is not advised.
If number of support vectors are high in SVM then low latency cannot be achieved.

Credits:

Applied AI Course
MIT Lectures on SVM
SVM Tutorial by Alexandre KOWALCZYK

Thanks for reading. :)

Basics of SVM for Classification.

Written by Vibhuti Siddhpura