Support Vector Machines (SVM), among classifiers, are probably the most intuitive and elegant, especially for binary classification tasks. To let you understand the intuition behind, I will explain them into a two-class environment: you will see that everything said will be valid also for multi classes tasks.
Let’s first consider the situation where our data are linearly separable.
Linearly Separable data
Generally speaking, the idea of SVM is finding a frontier which separates observations into classes. For this purpose, let’s consider the following example. Imagine you are a bank and you want to classify your clients among those who are creditworthy and those who are not, based on two features: monthly income and years spent working. To do so, you decide to collect some historical data of your past clients which have been already labeled as creditworthy or not creditworthy:
Our SVM algorithm is then asked to find a boundary which is able to segregate those two classes of clients, like so:
However, that was not the only way the two classes could have been separated. Let’s consider the following alternatives:
Since we can separate observations in numerous ways, SVM is performed so that it can finally find the boundary, called hyperplane, which best segregates the classes.
What does it mean “best segregating”?
In order to answer this question, we have to remember once again that the final goal of every algorithm is well predicting once adapted to new and unlabelled data. It doesn’t matter how accurate an algorithm might be once trained on our train set: if it is not able to properly classify future observations, it is completely worthless.
So, we can have 100% accuracy on our test set using thousands of different hyperplanes, but only one of them is also able to generalize the result, guaranteeing its own functionality even when future data will not “behave” exactly as train data (that is basically the reason why we need predictive models).
The optimum hyperplane is the one which guarantees the largest “area of freedom” for other future observations, an area where they can afford to deviate from their pattern without undermining the model. The SVM, in particular, is looking for a boundary that is maximally far away from any data point. This distance from that decision boundary to the closest data points determines the margin of the classifier.
The best hyperplane is the one which maximizes the margin, under the constraint that each data point must lie on the correct side of the margin. Now let’s look at the following examples:
You can definitely say that the hyperplane in Picture 2 is better than that in Picture 1, since the margin is larger. Note that both of them are perfectly able to segregate data, yet the second one will be more reliable once adapted to new, unlabeled data.
Now a further concept needs to be introduced. If you look at the graph:
you can see that there are some data points which are circled and they play a very important role in building the model. These data points, called Support Vectors, define a very interesting property of SVM optimization problem: only a few points actually end up in the final solution for creating the vector of parameters w which will define the best hyperplane.
The implication of what I’ve been explaining is surprising: every point that is not a support vector can be moved anytime you want, yet it will not affect our decision boundary parameters.
So far we examined a situation where the segregation was visible even to the naked eye, since data were linearly separable. However, what happens when we do not face such a utopian situation?
No-Linearly Separable data
Imagine you are asked to segregate the following data points:
There are two classes: squares and circles, however there is no way to trace a straight line to segregate those classes. How shall we proceed?
The answer to this task is the reason why SVM became so popular, as there is a trick, known as “Kernel Trick” from his inventor, which allows support vector machines to effortlessly go from linear models to non-linear models by using only a slightly different optimization problem from the one we have just seen.
What I want to do is providing an intuitive way to elaborate this task. Let’s see the idea behind with this easy graph:
Again, you want to segregate squares from circles, and here a possible boundary might be the one in green.
Now, the idea is the following: if we lift up the circles in a 3-dimensional plane, using a given transformation function φ(x) we would be able to easily separate observations with the support of a plan:
It would be definitely easier, but how can we compute φ(x)? Well, actually we are not asked to do so. Indeed, moving to a high-dimensional space is the kernel functions’ job, which creates an implicit feature space (in the example, the third coordinate has the only aim to lift up some observations and separate them from the others, but it does not have any meaning for itself) without even computing the coordinates of the data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space.
Therefore, the transformation function φ(x) is not needed as we are provided with a list of kernel functions, known also as “similarity functions” . The most popular is the Radial Basis Function (RBF) kernel, or Gaussian kernel function, which looks like that:
With x and y being two vectors of features representing two different observations. And this is the trick: we substitute a hardly computable transformation function with a pre-build similarity function.
So, thanks to this trick, we are able to easily handle no-linear dataset like they were linearly-separable.
In my next article, I’m going to provide a full example in Python of what I’ve been saying so far, so if you found it interesting I suggest you to have a look at it, so that you can see how a SVM algorithm actually works.