The SVM we need to know || The SVM we implemented.

Atul Mishra
Analytics Vidhya
Published in
7 min readSep 15, 2020

Hey guys, hope you all doing great and to start with, this is in continuation to my last 2 blogs, where we discussed about Hands-On Diabetes Prediction, followed by explanation of Logistic Regression and how it was helping in our Hands on application.

  • The support and feedback has been very kind and it makes me more passionate towards delivering and helping the community.

This post is very much in continuation to that, as here we will be discussing one more algorithm which we used in our Custom Pipeline, SVM(Support Vector Machines).

We have to understand what and why as we did for Logistic Regression, the same has to be considered for SVM too here. I was thinking to chime in the MNIST fashion dataset for this blog and DL model which I tried, but that would deviate from the topic in hand, so let’s start.

What is SVM?

  • Support Vector Machine is a classical and versatile Machine Learning model which can be used for classification/regression and even for outlier detection.
  • It can perform these operations on linear as well as on non-linear data, but the actual power of a SVM model is seen when there is a high dimensional data having high non-linear trend between the target and predictor variables. I did use SVM model for sentiment analysis and gave me some good results.
  • The fundamental idea behind Support Vector Machine is to fit the widest/largest possible margin between the decision boundary that separates two or more classes and training instances.
Visual Representation of Binary SVM Model
  • Now the above image belongs to the concept of Hyper-plane in 2D. Here, the line in the middle of margin is our HYPERPLANE separating 2 different classes. Then mathematical equation of a line is ax+by+c = 0, right?
  • So if we assume that hyperplane to be a line, then the left and right margins with data points becomes with equations as ax+by+c <0 for class -1 and ax+by+c>0 for class 1. This mathematics is important because right now we have 2 dimension, what happens when we go into 3 dimensions or generalise the same for N dimensions.
  • Now, let’s touch the 3D Plane concept too in order to reach the N-Dimension generalisability. So, a hyperplane in 3 dimension in SVM is called as a “Plane”. Here the equation of the plane becomes ax+by+cz+d = 0. And similarly the vector margins becomes positive and negative aspect of the plane.
Snip from my SVM Notes -> 3D Plane
  • Can any one answer this -> What is the dimension of a HYPERPLANE in a 3D space? Answer: Number of features -1. Features in a 3D plane are {x,y,z}.
  • For D(10/1000/million) dimensions , we have final equation as: sum(x1.w1+x2.w2+x3.w3 +…………..+xd.wd)+c = 0, where x1,x2,x3…..xd are dimensions and w1,w2,w3……wd are coefficients and c is our constant.
  • The model denoted by this equation is known as Linear Discriminator and it is one of the important technical interview question. But this far we have been only discussing what is SVM and considered only linear separable data. Now, in why section of SVM, we will see the power of SVM over non-linearly separable data.

Why SVM?

  • The classification problem can be solved using Logistic Regression, but oh wait a minute, isn’t Logistic Regression more suitable for Binary Classification? What if i have more than 2 classes? what if i have a higher dimensional data? what if my data is not in a linear fashion?
  • In such cases we need power of SVM to come and generate a plane/hyper-plane which can segregate the classes much better. But there’s a trick, how much you want to cost yourself to misclassify the points in order to achieve a better/stable model? Let’s first see some non-linear separable examples and then let’s discuss this issue.
Ways and views of non-linear trend
  • These kinds of messy trending data do gives us some hard time while modelling, but that’s where we need to figure out, do we want Hard Margin Classification or Soft Margin Classification.
  • Hard Margin Classification imposes that all the data points must be strictly off the margin and on the right-hand side of the margin.
Hard Margin SVM
  • Downsides involves, that these hard-margin classifiers are generally overfitted and they are only applicable when the data is linearly separable.
  • Along with that, they are even sensitive to outliers which also contributes towards the overfitting nature.
  • We will discuss/compare hard vs soft margin too.
  • In order to overcome all the challenges faced in hard-margin classification, to generalise well and have a more feasible model, Soft Margin Classification is used.
Soft Margin SVM
  • Here we give the model to misclassify some data points which in hard-margin is prohibited. This gives the power of model to fit the data points in a more fashionable way.
  • The objective here is to find a good balance between keeping the marginal street as large as possible and limiting margin violations.
  • Now, you might think, that since we allow model in soft-margin to perform some misclassifications, then how the model is optimum? Well, think from this perspective, you have to travel from point A to B and there are multiple routes to reach it. Path A-C-B is a bit more maybe 3 to 4 km’s more than path A-D-B, but the road is pathetic to travel, so what you do, you take the much suitable path so that your journey becomes comfortable right.
  • Similarly, we allow the misclassification, but we control it too with Hyperparameter tuning and using slack variable. Slack variable tells you the relative location of the variables to the margin and hyperplane.
  • Let’s see comparison and move on!
Hard vs Soft

It was important for us to go through this as then only, we can quote if an interviewer asks, how do you choose a best SVM model. Let me quote it one more time: The best SVM model is the one which has the maximum margin between the decision boundary that separates two classes and training instances.

Now the SVM we implemented in the diabetes prediction hands on was hard — margin classifier as we had linearly separable data points. Including to that, you might be wondering, why i haven’t talked about the Kernels which are being used in SVM.

  • Well to being with, kernels are generally brought into picture when soft margin classifier has to be used over non-linear data points. They require themselves a detailed explanation and implementation which i will be show casing in the upcoming posts. But i wouldn’t keep you in the dark, so just briefly, will tell what are different kinds of kernels and how they look.
SVM Kernels

Points I would like to share:

  • Unlike Logistic Regression Classifiers, SVM Classifiers do not output probabilities for each class. One of my colleague stated this and then shared a link, intending that SVM’s too output probabilities, but it is half true and half false.
  • To clarify this thing, let me quote it like this: a SVM model outputs the distance between a test/train instance and a decision boundary which can be used as a confidence score. However, this score cannot be converted directly into a class probability.Now in sklearn if we set probability = True while creating a SVM model, then while training, the model calibrates the probabilities using a Logistic Regression on a SVM Model. This will inject predict_proba() and predict_log_proba() methods to SVM.

Interview questions:

  • Why is it important to scale the inputs when using SVM’s? SVM’s try to fit the largest street/margin between the data points and decision boundaries. So if the training set is not scaled, then it will tend to neglect small features which maybe very far in a class.
  • An RBF SVM model is trained but it tends to overfit the training data, what can we do about it? If an RBF SVM model is underfitting the data, this would result in too much restriction being imposed, technically called as Regularisation. So there might be too much regularisation. Now in order to decrease regularisation, we need to increase gamma which is a regularisation hyperparameter or C or even both at once.
  • Try an SVM regressor on California Data Set. Exercise to attempt.

Conclusion

That’s all from my end for this post. I know it might be too theoretical but i tried to keep it in a more technical — non — technical fashion so that everyone can get the gist of how SVM’s work, what they are and can pass on the same explanation.

Also, for next post, I’m thinking to chime in Fashion MNIST Data set and DL model or continue with ML and bring in Decision Trees/Gaussian Naive Bayes Model. So let me know your views and choice for the next blog in the comments section. Till then, stay safe…stay hygienic.

Feedback is always an input for continuous improvement.

--

--