What’s under the hood of SVM?
What is Support Vector Machine?
SVM is a supervised machine learning algorithm that can be used for classification as well as regression tasks. However, SVM is mostly used for classification tasks. Linear SVM just like Logistic regression plots hyperplane which separates two different categories.
Geometrical intuition of SVM
Consider that we have two categories positive(dark circles) and negative(light circles), such that both are linearly separable. As they are linearly separable we can draw n number of hyperplanes that can easily divide both categories. Suppose we draw two hyperplanes H1 and H2 as shown in the above figure.
So SVM’s core idea is to find a hyperplane that can separate categories as widely as possible. Now u might be thinking why we need to do this. The answer is simple while predicting we observe probability i.e what’s the probability that this point belongs to a particular category, right?. Now observe closely H2 and the points near it, there are two positive and one negative point very close to H2. Hence if we train our model using H2 and predict their probabilities, they will be very low as they are close to H2 and whenever we see low probabilities we tend to discard such predictions and particularly in such cases, it’s wrong to land at such conclusions.
Whereas the hyperplane H1 separates the datapoints widely and the distance between the datapoints and hyperplane is also large when compared to H2, thus the problem of small probabilities is solved and the model tends to make fewer mistakes. The hyperplane which best separates the datapoints is called as “Maximum Margin Classifier”. This is the basic idea of SVM but wait that’s not it, there’s more.
As we got our maximum margin classifier hyperplane the next step is to extend that hyperplane on both sides till we intercept the nearest positive and negative points. Let’s call these hyperplanes as H+ and H- and the distance between H+ and H- is “d” which is also called margin. The points which lie on H+ and H- are called “Support Vectors” hence the name Support Vector Machines.
SUMMARY: In SVM we try to find hyperplane such that it maximizes margin distance(d) = distance(H+, H-) because as d increases the generalization accuracy also increases.
Now that we understood the geometric intuition of SVM, its time to get familiar with mathematical intuition.
Mathematical Intuition
In this section, we will try to understand the mathematical part of SVM.
Let, H = Margin Maximizer such that
H : w.x+b=0
H+ : w.x+b =1
H- : w.x + b = -1
margin = d = 2/ ||w||, s.t w.w !=1
where, w = Plane perpendicular to all three hyperplanes, b = bias term, we have also taken H+ and H- equal to 1 for simplicity however it can be any constant(k), ||w|| = L2 norm of w.
Do you remember in geometrical intuition we concluded, that SVM focuses on maximizing margin d, which can be represented in mathematical form as follows
w*,b* = max (2/||w||)
What this equation is trying to say is, we want to get the maximum margin such that we can get our w and b which are used to represent our H+ and H-.
The support vectors are at a distance of 1 from the hyperplane H. To find the distance of a point from hyperplane the formula is datapoint(xi)*vector perpendicular to the hyperplane(w). Hence,
w.xi =1 for positive support vector points, similarly w.xi = -1 for negative support vector points, where xi is single datapoint from the dataset. Similarly, for the datapoints which lie above the H+ and H-, w.xi >1 for positive points, w.xi < 1 for negative points. When we multiply the respective label with the distance of xi from hyperplanes for both positive as well as negative points we get yi*w.xi ≥1. Assuming positive points 1 and negative points as -1.
For positive datapoint: yi=1 and w.xi ≥ 1. Therefore yi*w.xi ≥1.
For negative datapoint: yi=-1, and w.xi ≤ 1. Therefore yi*w.xi ≥1
This is how we concluded, yi*w.xi ≥ 1 for all datapoints.
In the end, what SVM tries to do is maximize the margin d which we have seen above over and over again. And to achieve this target SVM tries to optimize the Constrained Optimization equation which is as follows.
w*,b* = max(2/||w||)
s.t ∀i, yi(w.xi+b) ≥ 1
This equation is also called a hard margin equation because the situation we have assumed is all the points will lie above H+ and H-, which is a too good situation and this harsh world wouldn’t let that happen. The fact is points can also lie between H+ and H- and to handle such situations we need to modify our optimization formula.
This is how the real world datapoints may look like. The points can either lie in between H+ and H- or they can also lie on the wrong sides i.e on either side of H. We saw before that all the points which are correctly placed for them yi*(w.xi+b) ≥ 1. But that’s for all the good datapoints, the story is different for bad datapoints like 1, 2, 3(refer above diagram). For datapoint:
1] yi*(w.xi+b) = 0.5 = 1- (0.5)
2] yi*(w.xi+b) = -0.5 = 1- (1.5)
3] yi*(w.xi+b) = -1.5 = 1 — (2.5)
For datapoint 1 the distance(yellow line) is 0.5 and after subtracting it from 1 what we get is Xi(ξ) which indicates how far the datapoint is from the correct hyperplane i.e H+ or H-. Similarly for datapoint 2 the distance is -0.5 the negative sign as it is on the wrong side of hyperplane(H), and ξ is 1.5 and for datapoint 3 distance is -1.5 and ξ is 2.5. Therefore we can say that
y*(w.x + b) ≥ 1 — ξ
When the situation is like this SVM focuses not only on margin maximization but also on minimizing ξ which is the miss classification error. Therefore the new modified Optimization Equation becomes
w*,b* = min[ (||w||/2) + C(1/n Σ ξ) ]
s.t ∀i y*(w.x+b) ≥ 1-ξWhere
C -> Hyperparameter,
n -> Number of datapoints,
Σ -> tends form 1 to n.
This modified equation is also called as soft margin optimization. In our above equation (Σ ξ) is hinge loss.
ξ is zero for data points which lie above there respective hyperplanes (H+/H-), and ξ = 1- yi*(w.xi+b) for points which lie on the wrong side of the plane, hence the value of ξ can be
max(0, 1- yi*(w.xi+b))
Hence, the optimization equation can also be written as
w*,b* = max(0, 1-yi*(w.x+b)) + λ||w||Where
λ -> Hyperparameter,
max(0, 1-yi*(w.xi+b)) -> Hinge Loss
In this article, we learned about SVM and math behind it, but SVM is not limited till here. SVM can also be used with the kernel which makes simple SVM highly powerful. I could have explained SVM with kernel but at the same time, I don’t want to stretch this article long.
Train Time Complexity:
SVM’s train time complexity is O(n²). If the dataset is very large SVM may take a while to train.
That’s all folks. Thank You :)