Support Vector Machine Part-1

Understanding the concept behind SVM

Abhinav Abhishek
Analytics Vidhya

--

Pic Credit- https://www.knowledgehut.com/

Introduction

Thanks for reading this article, in this article we will go through a very powerful and popular algorithm of machine learning Support Vector Machine. Here, we will try to understand the underlying concept on which SVM is based using very simple example from our real-world scenarios to understand it better. We will also try to understand few basic concepts of linear algebra, because it will help in understanding the mathematics behind SVM, which we will explain in next article.

Support Vector Machine-

A Support Vector Machine is a supervised machine learning algorithm developed by Vladimir N Vapnik and Alexey Ya. Chervonenkis in 1963, it is a discriminating classifier which works by plotting data in a N-Dimensional space and then find the best suitable hyper-plane which can classify data distinctly into their respective classes and for finding the hyper-plane it uses concept of maximum margin hyper-plane. It can be used for both classification and regression but it’s mostly used for classification problems. Today it’s one of the most used algorithms in Machine Learning. As we know the most trusted and popular algorithm in machine learning is neural networks, but there is a dent in its popularity and this dent is due to SVM, as by using much lesser computational power than neural networks it gives very trusted results with both linear and non-linear data.

The main idea or bottom line on which SVM works is that, it tries to find the classifier or decision boundary such that the distance between decision boundary to the nearest data points of each classes is maximum( such hyper-planes are also know as maximum margin hyper-plane). That’s why its also known as maximum margin classifier.

People who are familiar with linear classifiers like logistic regression, neural networks it’s easy for them to visualize the concept of decision boundary with maximum distance from each class, but let’s discuss the entire concept from scratch.

To classify any data, we first need to plot it in some space and it’s a general idea and we can see many example in our day to day life, suppose you were given a task to separate two kinds fruits say orange and mango kept in a bag in that case, you will take these and keep on a table into two groups with some appropriate distance between each groups , so when I say we plot data in a space then think the table as a space and fruits are the data points.

In same way in Support Vector Machine each data points are plotted in a N Dimensional space where N is nothing but number of features.

So why we take dimension of space same as number of features ?

The reason of it is in linear algebra, in linear algebra if we want to plot a point where X1 = 2 and X2 = 3 then we draw a graph as below and put point such that it has 2 distance from x1 and 3 from X2

Fig 1. Showing reason for taking dimension of space same as No. of features

So ideally to find the location to plot a point we want distance from each line and each line is a dimension in linear algebra, and entire graph we can think as space, so we need space with dimension equal to number of coordinates and coordinates are nothing but values of features.

Primary Objective of SVM and Why this objective is important :-

Most of the linear classifiers including SVM have one thing in common i.e. their objective to find the best fitted decision boundary between the classes such that there are either no or very less miss-classified points.

To understand this objective, lets take same common example of separating fruits which we have used above.

As we have a task to separate two types of fruits which are kept together in a bag, so to differentiate between two types you will usually look for color, shape, size etc. So the shape, size and color are nothing but the features, on the basis of which you can say which fruit is mango or orange. Let’s put these features in a table as below assuming that color code for orange is 1 and for mango its 2

Fig 2. Table showing features of data set.

So here the number features is 3, and all the above points will be plotted in 3 Dimension space.

The point drawn will have coordinates values which is nothing but the values of features. So, for row above coordinate will be (1,3,6). Now, consider that each such rows represent a point (point also represent the fruit) and then we plot each point in N (3 for this example) dimensional space, and now our objective is to find the hyper plane which can separate these points in two classes orange or mango in a such a way that we can say data in one side of the plane belongs to orange and data on other side are of mango.

Example above has 2 important concepts one is how a common example can be represented in graphical or linear algebra way and another one is what we actually want to find using SVM or other classifiers which is nothing but a best decision boundary .

Now, let’s take another example to understand our objective in terms of linear algebra and we will also try to find why this objective is important. Here we will take only 2 features and 2 classes as it’s easy to visualize -

Suppose we have two features X1 and X2 and we have two classes A and B. And based on number of training examples (suppose we have n number of training examples), we will have our points to be drawn and the values of X1 and X2 will be the coordinates of these points -

Fig 3. Figure showing graph of points after plotting it in 2 Dimension space.

Now suppose after placing the above points in 2-Dimensional space we have got graph like above. Since we have two feature columns (X1,X2) and as we know dimension of space depends on number of features, so we have chosen a 2D space.

Now we need a decision boundary in case of above example the decision boundary is the line to separate the classes A and B. But a line can be drawn in any direction and at any place, and the orientation and place of line is decided on several basis but the main and top most criteria in which we are interested is that the line should be drawn in such a way that it can divide entire data set in two parts (as we have two classes here and for multiple class the number of parts will be the count of classes) and division should be in a such a way that points belonging to one class should be on one side and points belonging to other should be on other side. So, in the above diagram we can say that the line drawn is effective and can be our decision boundary.

Fig.4. Graph showing miss classified data.

So from the above figure , we can say that although the point seems to be more closer to class B but due to decision boundary now it will be considered as Class A as it falls slightly on left hand side of the line, so we can say that although the line was best fitted for training data but it failed in case of test data and in machine learning terminology we call it as over fitting. Therefore, to avoid over-fitting or to reduce the miss classification cases we not only need a decision boundary but we also need to find a best one.

So from above examples we got our objective to draw a best decision fitted boundary and we also saw why it is important.

To get best decision boundary, we have many options like apply neural network and through gradient decent we can draw an arbitrary shape to achieve best classification.

See figure below-

Fig. 5. Graph showing decision boundary drawn using Neural Networks.

But this will require huge computational resources,

So the question here is that do we have any other effective mechanism to solve this problem without using that much resources?

The answer is Yes, we have Support Vector Machines which can help us in this. But before we start with details of SVM lets first get an idea of hyper-plane which we are going to use in explanation.

Hyper-plane- As we have seen above that we draw a hyper-plane to separate the points but what’s exactly is hyper-plane. A hyper-plane is a geometric entity which has dimension one less than the dimension surrounding it. As definition says that, dimension of hyper-plane is one less than dimension of space, so if space has the dimension N, then

Dimension of Hyper-plane= N-1

For example in a 3 Dimensional space the hyper plane will have 2 Dimensions and as we know that a 2 Dimensional entity is called plane so the hyper-plane in 3D space is a plane. Similarly, a hyper-plane in 2D space will have one dimension and we know a 1-dimensional entity is nothing but a line.

For example, above we had 2D dimension so dimension of our decision boundary must be 1D, that is why we have drawn a line to separate our data-set. So, a plane is nothing but a projection of a line in 3 Dimension space.

But before we move further , we should know the concept of hyper-plane in terms of Machine Learning. In machine learning, a hyper-plane divides the data-set in their respective classes so if we have two classes then the hyper-plane should divide that in 2 parts.

Now as we have seen the concept of plane so let’s see the Support Vector Machines in detail-

As we know the idea behind support vector machine is to find a plane or a decision boundary such that distance from nearest points of each classes to the decision boundary is maximum. Here we can have 2 questions, one is that why we need maximum distance and second is that why need maximum distance from each class. To understand it let’s take same example of separating fruits in 2 parts or classes i.e. orange or mango. And this time we will not draw a graph and rather take a very simple approach, see the figure below-

Fig. 6. Graph considering only size and color as features.

Consider only features i.e.size and color for now, and we can say that if size is less and color is Orange the class is orange and if size is bigger and color is yellow it belongs to mango. Now our goal is to find a decision boundary so that we can say that data on left hand side of the decision boundary belong to orange class and those on right hand side of decision boundary belong to mango class and by seeing figure above we can say that we can place the decision boundary any where so, let’s take 3 cases one closer to orange class denoted in fig.7 below as D1, one closer to mango class denoted by D2 in the figure and for decision boundary with maximum distance from each class lets calculate distance between nearest points of each class i.e. Po of orange class and Pm of mango class and place our decision boundary exactly at the middle of that distance it being is denoted by Dm in the figure below.

Fig. 7. Graph show 3 decision boundaries.

Let’s take the case of decision boundary D1 ,it seems okay as distance between it and the nearest point of at least one class i.e. mango is maximum, but what will happen when we get a data of an orange (Po) whose size is little more than other like in figure below-

Fig. 8. Graph show case of miss classification when D1 is considered.

And as per our decision boundary ,it belongs to mango class as it is on right hand side of decision boundary, but actually it seems very close to class orange so here we can say its a miss-classification case and our decision boundary is not capable of handling a scenario where the size was little higher . And in terminology of machine learning our decision boundary is not generalized to handle such variations.

Similarly, for case of decision boundary D2 having maximum distance from class orange it can be an appropriate decision boundary, but case where size of mango is less than usual shown as Pm below so here our decision boundary will put it in class orange but actually the data is very close to class mango. So, this decision boundary is also not the best one as it also fails to handle slight variations in data.

Fig. 9. Graph showing miss classification when D2 is considered.

Now let’s take our decision boundary Dm having equal distance from nearest point of each class.

Fig. 10. Graph showing case of decision boundary having same distance from both classes.

Here we can say that, it’s the best decision boundary we can have as it has successfully handled the variations present in points Po and Pm which were getting miss-classified by other 2 decision boundaries.

So, from above example we got the answers of our 2 questions-

Firstly, why we need maximum distance because if we take maximum distance then we will able to avoid miss-classification that can occur due to some variations in data.

Secondly, why we need maximum distance from each class, because in this case we will have freedom for each class to adjust variations in data properly.

So, when we talk about distance from nearest points, we actually have a terminology in linear algebra for it which is Margin. We can define margin as below-

Margin- A margin can be defined as the distance of the closest points to the decision surface. We can also say that the margin is the distance between the decision boundary and each of the classes. So in figure below we can see that points (X11,X21) and (X12,X22) are closest to the decision surface hence the distance between these point and decision surface is the Margin.

Let’s plot above points in the graph below to visualize the concept in more details-

Fig. 11. Graph showing Margin

And for the example of fruits classification the points Po and Pm are nearest points so distance between those and the decision surface Dm is margin.

Fig. 12. Graph showing Margin.

Conclusion

So, in this article we tried to understand underlying concepts:-

· What is support vector machine.

· Concepts of space through an example.

· Dimension of space is related to number of features.

· Through a graph and an example we got an idea of how data points are plotted in a space and how we draw a plane to divide it.

· Idea of hyper-plane was explained .

· We went through SVM in detail ,and with few examples we got the idea of margin and why we need to have maximum margin from each class.

Now in next article, we will see the mathematical concepts behind SVM and we will try to get the intuition behind it using an example of another classifier algorithm Logistic Regression.

--

--