SVM Support Vector Machine (Beginner’s ML Series) Part — 1

Everything you need to know about SVM. From theory to math and coding by keeping it simple yet effective

Amir Khan
Analytics Simplified
8 min readAug 25, 2019

--

This article is co-authored by Muhammad Hamza

We are going to cover the following topics:

1- Overview

2- Introduction

3- How does SVM Work

4- Support Vectors & Margin

5- Linear & Non-Linear SVM

6- Hard Margin & Soft Margin in SVM

7- I/O of SVM

8- Hyperparameters of SVM

9- Advantages of SVM

10- Challenges in SVM

11- Disadvantages of SVM

12- Conclusion

Overview

If you have no idea of what SVM is, what is it used for or how does it work and want to have answers to all these questions, then you have come to the right place. In this article, we are going to explore SVM by starting with a daily life example and then try to understand SVM in a simple way keeping our discussion brief and precise. So, let’s begin:

SVM stands for Support Vector Machine. It is a classification algorithm. Before we go any further, let us discuss what is classification

Classification

If we flip a coin, we either get a tail or head meaning there are only two possible outcomes or we can say that there are two classes to which a particular side of a coin can belong. Classification is the process of classifying outcomes of any event (like flipping a coin is an event with two classes / outcomes) into predefined classes (like head-tail in coin flip event).

Introduction:

SVM is used to classify inputs (more on inputs later) into one of the predefined classes (like Yes / No or Head / Tail).

If SVM is used to classify two classes like head / tail, then such a classifier is binary classifier otherwise it becomes multiclass classifier

SVM is both a binary and multiclass classifier

Let us proceed with the algorithm using a real-life example so that we can follow along in a better way.

Assume that in a university, students are enrolled in two classes — statistics and math. Based on the performance of students in these courses, the faculty has to decide whether a particular student will be given access to machine learning course or not. In other words, whether it will be a Yes or No. Now the faculty has data of past students and faculty wants to automate the procedure of giving access of machine learning to students based on their performance.

Clearly, this is a classification problem and we want a machine learning algorithm which can solve classification problem.

As we know that SVM is one of the best algorithm for classification. So, let’s dive deep into SVM

How does SVM Work

The first thing we would like to do with data is looking at it by plotting it so that we can have a general idea of data that is available to us.

Source: Kdnuggets

From the figure above, it can be seen that the red circles indicate the students with relatively low scores in courses while the blue circles represent pool of students that did well in both the courses. It means blue circle is our Yes class since these students will get access to machine learning class and red are the ones that wouldn’t get access to machine learning class.

Our the task of SVM is to draw a line between these classes (Yes / No) in such a way that these classes wouldn’t mix with each other. There can be infinite lines between these two pools of circles (or these two classes). Two possible lines are shown below:

Source: Kdnuggets

We can see that both the lines do the job of separating these classes from each other and there is no mixing of classes on both sides of the line.

Drawing line (to separate the classes from one another) is the whole idea behind SVM.

The line drawn by SVM for separation of classes is called hyperplane. We have two hyperplanes in above figure. So, which one to choose?

It is the answer to this question that differentiates SVM from other classification algorithms. SVM chooses the line (or hyperplane) that does maximum separation between classes.

Or we can say that, SVM chooses the hyperplane that has the maximum distance from circles (or points) on both sides. But there are lot of points in figure so, which points to consider for distance calculation. Well, SVM considers closest points to the hyperplane as can be seen in the figure below:

Hyperplane, Margin & Support Vectors (Source: Kdnuggets)

Support Vectors & Margin in SVM

It can be seen that this hyperplane has the maximum distance from closest points on both sides. These are points that helps in deciding optimal hyperplane because only these are the closest points. Hence these closest points (two points in shaded region in above figure) are the ones that support optimal hyperplane and it is due to this reason these points are called support vectors. The distance between support vectors is called margin.

We can now say that SVM tries to maximize the margin so that there is a good separation between classes.

Apart from hyperplane (the black line in above figure), we have two other lines on which support vectors lies. Our main hyperplane (the black line) is called decision boundary to differentiate it from other two hyperplanes on its left and right.

Linear & Non-Linear SVM

Linear SVM

The data we have above has a straight line between the classes. It means that data is linearly separable or by a straight line. The SVM which draws a straight hyperplane between classes is called LSVM- Linear SVM.

Non-Linear SVM

But in real life, the data is not always linearly separable. In such cases, SVM will not draw a straight line between classes, rather it might be a curved line like shown below. (Note that we are now talking about Non-Linear SVM)

Non Linear Data (Source: Kdnuggets)

In this case, we cannot completely separate the two classes. There are two possibilities:

Either we draw some curved line to separate classes completely

→ Or we allow for some errors (misclassifications) and separate them linearly

In the former case, we are talking about non-linear SVM. Kernel trick is something that will help us achieve that curved line. So, kernel trick is nothing but a way to change the shape of hyperplane to avoid misclassification. Default kernel is linear.

Hard Margin & Soft Margin in SVM

If data is linearly separable then SVM might return maximum accuracy or we can say that SVM did perfect classification. Such a margin is known as Hard Margin. But at times, we don’t have linearly separable and in order to draw a linear in such a messy data, we have to relax our margin to allow for misclassifications. Now this (relaxed) margin is known as Soft Margin

Input / Output of SVM:

SVM always expects numerical input and returns numerical output. Since it is a supervised algorithm, so along with the input, it needs labels as well. A sample of inputs (X1, X2) along with labels (Y) is shown below

Source: Jason Brownlee Book on ML

Hyper parameters of SVM

Hyper parameters are used to tune algorithm in order to maximize its accuracy. Everyone machine algorithm has certain hyper parameters.

SVM too has some hyper parameters. Let us discuss some of them.

Hyperparameter C (Soft Margin Constant)

In the latter case of allowing some error (or misclassification), we control amount of error through a parameter called ‘C’. it can be any value like 0.01 or even 100 or more. It depends upon the type of problem and data we have.

C directly effects hyperplane. C is inversely proportional to width of margin. So larger the C, smaller the margin and vice versa. Effect of C on margin can be seen below:

C = 0.1 (Large Width) Source: yunhaocsblog

Now let us increase value of C to 10 and see it’s effect on margin.

C = 10 (Small Margin)

So there is no hard and fast rule of selecting C. It completely depends upon problem and data at hand.

Hyperparameter Gamma

We have discussed earlier that hyperplane is decided based on support vectors only. Points beyond support vectors are not given any weightage. But gamma hyperparameter is used to give weightage to the points beyond support vectors.

Small value of gamma indicates that points beyond the support vectors will be given weightage in calculating margin (or deciding hyperplane) and vice versa.

But why do we do it ? Well, changing value of gamma changes the shape of hyperplane and hence new hyperplane → new accuracy. So, at the end of the day, it is accuracy that matters. No matter, how we achieve it.

Kernel Trick in SVM

As discussed earlier, kernel is used to transform non-separable data into separable data. It is achieved by changing the name of kernel in SVM Model.

Some of the kernels are Linear, Polynomial, Radial etc.

Advantages of SVM

→ Can work efficiently with small data

→Can work when no of features are greater than no of samples

→It is effective in high dimensional spaces (kernel trick)

→ Memory Efficient — uses subset of training data (only support vectors are condidered)

→ Can control overfitting and under fitting (Using C hyperparameter)

Challenges in SVM

→ Choice of appropriate kernel and its parameters

→ Choosing between hard margin /soft margin

→ Choosing the right value of C

Disadvantages of SVM

→ It doesn’t perform well, when we have large data set

→ Sensitive to noisy data (Might overfit data)

Conclusion

We hope that you have now a general idea of SVM and you are good to go to read further on your own.

Math and Coding of SVM will be discussed in part 2 and part 3 (will be available soon). Follow us to learn about other machine learning algorithms.

Thank You

--

--