Quick Tutorial on Support Vector Machines

Published in

The Startup

5 min readJul 2, 2020

Support Vector Machines (SVMs):

SVMs are a powerful class of supervised machines learning algorithms for both classification and regression problems. In the context of classification, they can be viewed as maximum margin linear classifiers. Why? Well, we'll see that in a bit.

The SVM uses an objective which explicitly encourages lower out-of-sample error (good generalization performance).

For the first part we will assume that the two classes are linearly separable. For non-linear boundaries, we will see that we project the data points into higher dimension so that they can be separated linearly using a plane.

Let's create a dataset of two classes and let the classes be linearly separable for now.

Linearly separable classes:

Now, we know that we can differentiate these two classes by drawing a line (decision boundary) between them. But we need to find the optimum decision boundary which will give us the minimum in-sample error.

Many possible separators:

Considering these 3 decision boundaries, the point 'x' can easily be misclassified by these decision boundaries. Therefore we want our classifier to be robust to these kind of perturbations in the input that can lead to drastic change in the output. We will see how SVM will overcome this situation by plotting margins.

We know that we can draw millions of lines or decision boundaries for classifying the classes but we want the best decision boundary which have good generalization performance and lowest out-of-sample error. For achieving this what SVM does is instead of having a zero width line, as we have in the above graph, it draws a margin on both the sides of the line of finite length upto the nearest point.

Plotting the margins:

What SVM does is it chooses the decision boundary which has the maximum margin and chooses it as the optimum model.

SVM in practice:

Now that we have a good understanding of when to use SVM, let's see how to implement SVM from scratch using scikit-learn.

SVC(C=10000000000.0, break_ties=False, cache_size=200, class_weight=None,
    coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale',
    kernel='linear', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

This does not seem to be very intuitive. So let's plot the decision boundaries.

Plotting the SVM Decision Boundaries:

The dotted lines here are known as Margins. The data points touching these margins are known as Support Vectors. In Scikit-Learn, the identity of these points are stored in the support_vectors_ attribute of the Support Vector Classifier.

array([[0.44359863, 3.11530945],
       [2.33812285, 3.43116792],
       [2.06156753, 1.96918596]])

Overlapping classes:

The data points of the two classes that we have seen in the above example were very clearly separable i.e. there was no overlap between the data points of the two classes. But what if there is an overlap between the points of the two classes?

In order to handle such cases, we need to tune the hyperparameter C of the SVC model. This process of tuning the hyperparameters of a model for a better fit is usually known as Hyperparameter Tuning. Depending upon the value of C, we can have soft or hard margins which decides how much classification error is permittable.

We will now see how changing the value of C affects the fit of our model.

As you can see in the first figure where C=10, very less or none of the data points were allowed to enter into the margin space which is not the case in the second figure where the value of C=0.1.

Non-Linearly separable classes:

Till now in our discussion we have seen data that is linearly separable. But what if the data points of the classes are not linearly separable? What if it is something like this:

Let's see what happens if we try to fit the SVC model with kernel as linear:

This doesn't seem to be good, right? Our linear SVC model is not able to differentiate at all between the classes.

Remember at the beginning we discussed in short that if the classes are not linearly we would project the data to higher dimension and then draw a hyperplane that would separate the classes? Lets visualize the data in 3D since we have only 2 classes.

When we project the data to higher dimensions we see that the data becomes linearly separable and we can separate the data using a hyperplane.

But there is one problem here. We had only two classes here so projecting to 3D was no problem but what if there were N classes? We have to project it to N+1 dimensions which is not feasible.

Thanks to SVM, we can overcome this by using the kernel hyperparameter. Using what is called as the kernel trick we can separate the classes without projecting the data to higher dimensions. We just need to change the kernel from Linear to RBF (Radial Basis Function).

SVC(C=10, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Let's plot the decision boundaries.

Isn't it powerful and intuitive!

In this section we have tried to understand and implement how SVM works for both linear and non-linear data. Try implementing it with different set of data points.

Hope you understood and liked this post. If you have any suggestions or any feedback please do reach out to me. I'll be happy to hear from you.

Will see you in the next post. Till then take care, stay stafe and stay helathy!