Understanding Machine Learning Algorithms — Support Vector Machine(SVM)
SVM (Support Vector Machine) can be used in Regression and Classification in this blog, we derive SVM from first principles.
What you will learn?
- Geometric Intuition
- How SVM works and Math
- Kernel Trick
- How it works when we have an Outliers
- Use cases of SVM
1.Geometric Intuition
Case-1: In case-1 there are two planes that separate these points we are okay with these planes because these are classifying the point anyway.
Case-2: In the case-2 also there is a plane separates these points intuitively it is a better classifier compare with the case-1 classifiers because it is classifying in such a way when any new data point or unseen data point comes it easily classifies it.
The Idea of SVM:-
The plane tries to classify the positive and negative points as widely as possible and it is called Margin Maximizing Hyperplane. In the above case-2, positive and negative points classified as widely as possible.
2. How SVM works and Math
Task:- The task is to find the W and b such that maximizes the margin distance we can write it mathematically
For support vectors which pass through the points for positive points, the point is positive and the support vector is positive so it is wT+b=1. For negative points the point is negative and the support vector is negative so wT+b=1.
So we can that for every far point from the support vectors positive side or negative side it will greater than one wT+b>1 for correct classified point it should be satisfied with this condition if it is less than wT+b<1 it means the point is wrongly classified we can write is as mathematically…..
There is a problem with this we have some negative points in positive region or some positive points in negative region it is not satisfied the constrains and it won’t the find w and b.
How can we modify it Primal SVM?
We are creating one variable Zi for incorrectly classified points if any point where incorrectly classified in the negative region or positive region then Zi will be positive otherwise it will be 0 for correctly classified points.
Here, C is the hyperparameter C increases we are giving more importance to not making mistakes so it will overfit to the training data as C decreases we are not giving more importance to the making mistakes so our model underfits to the training data. This form of SVM is called Primal SVM
Dual Form of SVM
In optimization theory using mathematics, we can show that this Primal of SVM is equal to the Dual form of SVM instead of solving the Primal form of SVM we solve the Dual form of SVM because. In the Dual form of SVM Xi’s occur in the form of Xi transpose Xj.
3. Kernel Trick
In the dual formula of SVM instead of using Xi transpose Xj we can use any similarity function sim(Xi, Xj). Here one class of similarity functions are Kernel functions so this is often replaced by K(Xi, Xj)
The most imp idea of SVM is kernel trick it is super important because in soft SVM of hyperplane is very similar to Logistic Regression except for our margin maximization idea so if we don’t apply Kernel trick we call it as Linear SVM.
So kernel trick which made SVM so powerful linear SVM will fail in some cases where our data is not separated linearly but if we apply Kernel SVM it works fairly well.
What does Kernel Trick do?
Kernel-SVM using the Kernel trick can solve non-linearly separable data. Kernel-SVM literally it transforms the data to different space and in that space, it finds the hyperplane. There are many Kernels available like Polynomial, RBF and etc…
4. How it works when we have an Outliers
Outliers have very little impact on our model because the Support Vectors that matters for our model if we have any outliers in our data the impact is very minimal.
5. Use cases of SVM
In the case of SVM, feature transformation is replaced by Kernel we have to design a good Kernel that works very well. Finding the right Kernel not easy some times for a given problem but we have default Kernel RBF but, if we find the appropriate Kernel for given problem SVM works really well.
In the case of Linear SVM, our decision could be Hyperplane whereas Kernel-SVM can have a Non-Linear decision surface using the Kernel Trick we can solve the non-linear separable data also. If we are given Similarity or Distance function it works very well.