Support Vector Machine (SVM) is one of the most powerful classifiers in Machine learning. At the same time, it is one of the most difficult to understand.
In order to understand SVM, you might have to revisit some basic high school level math concepts.
Two critical concepts to master are Vector and Dot products between vectors.
Dot product between vectors results into scalar number and it shows similarity between those vectors.
Dot product of X1 and X2 will be 1*5+2*6+3*7+4*8=5+12+21+32=70
import numpy as np
Along with indicating the similarities, dot product between two vectors of unit length return cosine of angle between them.
Let’s start with a simple case. Consider below visualization. A company has recorded age and income of it’s prospective customers and whether or not they purchased its product.
Below is visualization of what the company recorded.
Now if you are given a a list of new prospective customers age and income, you should be able to predict whether or not they will purchase the product.
Let’s use SVM.
Looking at above visualization, first thing that comes to mind is draw a straight line and separate stars from red dots. Let’s do that.
Looks some what better now. But, we can draw this separator line several ways. Some of them are shown in below visualization.
And based on which separator line we choose, the prediction of new data point (green square) will be different.
So, which separator is correct?
This is where the concept of maximum margin comes into play.
The challenge is to come up with a separator line from which the nearest vectors on either side of the line has maximum margin.
The vectors in blue circle in above visualization are called support vectors. These vectors are considered for maximum margin.
SVM task is find a separator which separates two classes (Yes and No)with the constraint that margin between the separator and nearest vectors should be maximum.
Let’s find equation representing the separator. Draw a perpendicular vector w on the median line of separator plan. This perpendicular vector (also called normal vector) w will indicate the orientation of median line. Considering b as distance of median line from origin the equation of the median line will be
However, you can draw several perpendicular line to the separator. That means there can be several w vector because we have not yet fixed the length of this vector. Let’s fix the length of vector w.
w.x+b≥1 for Yes points
and w.x+b≤1 for no points
Now we have fixed size of w.
Let’s combine above two equations in one equation. For that we need a new variable y such that
y = +1 for Yes points
and y = -1 for No points
With addition of y in above equation, we will get
For support vectors y(w.x+b)-1=0……………………………>Eq(1)
Points on Yes side will have value y(w.x+b)-1>0 ……………..>Eq(2)
and points on No side will have value y(w.x+b)-1<0……………>Eq(3)
These equations are called decision rules.
Now that we got equation for vectors in separator plane and vector on either side of it, lets find an equation for the width of separator plane.
The distance between support vectors Xa and Xb in above visualization can be calculated as
Replace Xa and Xb with the equation above Eq 2 and Eq 3. For Xa y will be 1 so for Xa
w.Xa +b -1 =0
w.Xa = 1-b
and for Xb y will be -1 so for Xb
Eq(4) will further expand to
Maximizing the width of separator plan means
Let’s summarize what we got so far.
This is optimization problem. We need to use Lagrange multiplier to solve this optimization problem.
Formula for Lagrange Multiplier is below
L = Optimization Target — Summation of Constraints with a multiplier
For us the optimization target is minimization as mentioned in Eq (7) and constraint is decision rule wx+b-1. Let’s put these in Lagrange Multiplier
To find maximum (or extreme) point of something you need to find it’s derivative and set it to zero. We need to differentiate L with respect to w and b.
This result to
This equation indicates w is linear sum of some of the sample vectors X.
Let’s look at L differentiation with respect to b.
Let’s get back to Lagrange Multiplier and replace w based on above outcome.
Using Eq(12) we can make the yellow background part of Eq(14) zero. This reduces the equation to below.
Eq(16) shows that optimization of the width purely depends on dot product of support vectors.
Now, let’s rewrite decision rule w.x+b by replacing w with Eq(10). Consider we got new vector u and we want to know whether this will go to Yes class or No class.
If below is true than new vector u belongs to Yes class.
If above is less than 0, new vector belongs to No class.
The highlighted part of the equation shows that decision rule also depends on dot product. Now you know why I started this article with explaining dot product.
And so far we only covered linear sample. Non-linear sample and Kernels are still there to explore.
Non-Linear Separation & Kernel Function
So far we have worked on a case where the data was linearly separable. What if the data is not linearly separable. Let’s keep it for next article.