Support Vector Machines — What Are They?

Bhanu Kiran
6 min readJan 8, 2023

--

Are they really machines? What are these SVMs people keep talking about? SVMs are quite the popular models out there, and are heavily based on mathematical methods, well ideally if you say math or put math into anything, things might become more complex or automatically more daunting, but over here, things are as simple as they can be!

If you are somewhat familiar with classification methods you should be aware of classification by data separation, these methods either include a yes or no, binary formats such as 1 or 0, or multiple labels such as cat, dog flower, etc. Let’s take an example, as you see below in Fig 1. we have a bunch of data scattered on the graph.

Fig 1. data scattered on a graph

At a glance, if we were to separate this data, we would draw a line or a “boundary” between the two data points. But a boundary drawn by person A can be quite different from person B, person C, and so on. And all of these operations are obviously correct, there is no wrong answer, let’s plot some various possible boundaries for Fig 1.

Hyperplane

Fig 2. boundaries

from Fig 2. we can see that there are several boundaries to separate the data points, but there is definitely, and always, one perfect line of separation between the data points, and this is called a hyperplane. The catch here is, there can be more than one hyperplane, as there can be infinite boundaries.

If we find a hyperplane that separates the data perfectly, we can say that the data is linearly separable.

But how do you choose the location of the hyperplane? To achieve this we use something called a Maximal Margin Classifier.

Maximal Margin Classifier

The idea here is that, if we look at points from away from the boundary, we can be sure of their class, it's either this or that, yes or no, 1 or 0. But if we move closer to the boundary, there it is difficult to say, in reality, the lines and data points aren’t as displayed in Fig 2. you can have a bunch of data points and you draw an optimal boundary, but, you can have a data point 0 falling just close to the boundary of 1 and 0 or the 0 can call in with the 1s.

In these cases, we look at the closest points, and we can find that there is some space, a space that creates a margin.

Fig 3. margins

We can say from these margins in Fig 3. that, the larger the margin, the more separated the points are. And these green and orange lines, from Fig 3. are called maximal margins.

Fig 4. maximal margins

Now, intuitively, we can find the most optimal hyperplane, that can make out data linearly separable, and this is done by finding the maximal margin, with the greatest separation, which gives us the greatest predictive power.

Now that we have talked about all this, the real question is, what are support vectors?

Support Vectors

In Fig 4. you can observe that some of the points are on the green and orange lines, these points are called support vectors.

To simplify this let's break down the words, support means something that can assist, and vector in machine learning is a tuple of one to more data points/scalars. Now it makes sense! support vectors are a bunch of data points, which lie on the maximal margin.

Fig 5. support vectors

The hyperplane can be defined by those closest points to the boundary, and we call them support vectors. The maximal margin hyperplane depends only on the support vectors, and these support vectors define the classification model. The model is a set of data points that define the location of boundaries between the two classes.

All of this makes sense, and now we can start to define how SVMs make predictions, if you paid attention so far, you might realize that most of the work is done by the maximal margin, and that is exactly the theory behind support vectors, it is maximal margin classifier

Maximal margin classifier — prediction

Say, for instance, we have two classes, green and orange.

Fig 6. maximal margin classifier

for these two classes we define a hyperplane which is done via the maximal margin, once we have our hyperplane the prediction depends on where the new instance is located compared to the maximal margin hyperplane.

Fig 7. prediction using hyperplane

But in reality, life is more complex than the figures shown above, and very often the data is scattered all over the place.

Soft Margin

Fig 8. soft margin

In this case in Fig 8. the data towards the left also has some orange points, and at a glance, we know that this is not linearly separable.

The key is that we try to find a line separating most of the data points and accept that some of the points are misclassified. This is called a soft margin, where we let some of the data points be on either side of the hyperplane. These obviously are errors and misclassification and how do we handle this?

The key to handling such situations is to use a cost function, where we accept some misclassified instances.

Why? because soft margin classifiers tend to be more robust than maximal margin classifiers. If I was to take a new instance which is orange and put it on the green side of the hyperplane in Fig 7. it would poop itself, and now the maximal margin has to change, which means my hyperplane will change, in other words, it is very sensitive to new data, this leads to a risk of overfitting. On the other hand, soft margin classifiers are generally more robust and use cost functions to predict and are better classifiers.

The main idea is to make the data linearly separable, but what if it looks like there is no way in the world, we can make the data linearly separable, for this, we use kernel functions.

Kernel Functions

Fig 9. kernel functions

As you can see in Fig 9. something that seems like we cannot do anything about it can be separated with kernel functions. And support vector machines use kernel functions to transform the data. But what is it actually doing?

The kernel function enlarges our feature space as if in higher dimensional space, where the data is linearly separable. and you have several kernel functions to choose from.

  1. linear function
  2. polynomial function
  3. radial basis function
  4. sigmoid function

You can do a quick google search to understand what each of these functions does in detail.

And as simple as it sounds, that's a support vector machine for you, a bunch of lines, kernel functions, and margins.

--

--