Demystifying Support Vector Machine from Scratch

Find the “best margin” and reduce the “risk of error”.

Rishikesh Sivakumar

Published in

SRM MIC

11 min readMar 8, 2021

A small insight on Supervised Learning

We all know that supervised learning is the type of machine learning algorithm where we have a labeled dataset.
We further divide this dataset into train and test datasets.
We use the train data to train our Machine Learning model and test it out with test data and accordingly calculate the accuracy of our model.
If our model has fairly good accuracy, we can use it for obtaining results for a new set of data.

There are various supervised learning algorithms and each algorithm works best for a particular pattern or type of dataset. It is of utmost importance to choose the model with the best accuracy and the model which predicts or classifies better based on that particular dataset.
One of the most popular supervised learning algorithms is the Support Vector Machine.

Introduction to SVM

Support Vector Machine is a type of supervised learning algorithm which is very useful when we are dealing with datasets having more than 2 features, i.e. 3 or more dimensional data. This algorithm is clean and accurate even when our model is trained on complex non-linear data. After training, the algorithm creates a hyperplane where each classification is done in such a way that each type of data is present on either side of the hyperplane.

For example, if we have 3 different data types, a hyperplane is created in such a way that each of the 3 types is on either side of the hyperplane.

Hyperplane

Hyperplanes refer to the decision boundaries that are created by SVM models for classifying the data points.

A hyperplane in a dataset consisting of 2 features is just a 1-D line separating the data points, each type on either side of the line. On the other hand, in a dataset consisting of 3 features, the hyperplane will be a 2-D plane which is separating the data points. Look at the example below:

It is difficult to visualize hyperplanes for datasets having more than 3 features, but SVM similarly creates hyperplanes to classify the data points.

Working of SVM Algorithm

If we have a labeled training dataset of n-features, SVM plots these examples in n-dimensional space.
While plotting, each feature will become the value of one axis, therefore, n features = n axes.
Then, the SVM model creates a hyperplane that separates different types according to the kernel provided.

SVM is also called “Large Margin Classifier”.

Let me explain why with an example.

Large Margin Classifier

Let us take a simple dataset that has two different types of examples based on 2 different features, x1 and x2.

Here, all the Xs are one type of examples and all the Os are another type of examples. We could use any classification algorithm to create a classification model which could divide these two types of examples.

Our model could divide this dataset in this way:

This could be the ideal way to classify where all the Xs are on one side of the hyperplane and all the Os are on the other side. This also seems to be very accurate as this classification is 100% accurate on the training dataset. There is a slight chance that it might also work well with the test dataset. But, will this method work well with a new set of examples? No.

A machine learning model is not a good or accurate model unless it can do accurate classifications and produce results with maximum accuracy when we give new data to the trained model.

Why does this model not produce the best results with a new dataset?

This classification is actually overfitting the dataset. This makes the algorithm fit very well to the training dataset but doesn’t produce good results with the test dataset or new dataset.

Large Margin Classifier (continuation)

Therefore, a more regularized and better fit to the training dataset would be this prediction line:

Here, even though one X is on the O side and the prediction on the training dataset would not be as accurate or precise as the previous prediction line, this model fits better and works better with the test and new dataset.

This classification is done by our SVM, i.e. Large Margin Classifier. The primary aim of our classifier is to create a prediction hyperplane in such a way that the distance between the closest example of each type of data from the hyperplane is large (as shown in the figure below).

If we take this hyperplane formed, then the first blue line represents the nearest O from the decision boundary and the second blue line represents the nearest X from the decision boundary. The red arrow indicates the distance between the closest example of each type from the decision boundary.

Here, we can see that the distance between the closest examples and the decision boundary is pretty large, hence this would be the hyperplane formed by the SVM, hence the name Large Margin Classifier.

On the other hand,

In this case, we don’t even have to draw the other blue lines as the nearest O and the nearest X both are touching the decision boundary formed by the model.

Therefore, there is no distance between the nearest examples of each type and the hyperplane formed by the algorithm. Hence, this is not the best algorithm and also our SVM would form the first hyperplane discussed and not this one.

The main aim of our SVM model is to keep this margin as large as possible on both sides of the decision boundary. Maximizing this margin distance provides reinforcement so that new data points can be classified more accurately. The major explanation why the classifier uses a large margin could be understood if we learn the mathematics behind it.

Now, let’s start with support vectors.

Support Vectors

We learned the objective of our SVM algorithm is to maximize the margin for obtaining the best fitting model. This margin is maximized using support vectors.

Support vectors refer to the data points that are closer to the hyperplane. These vectors influence the position and orientation of the hyperplane.

If we delete a few of these support vectors, the whole orientation of the hyperplane changes and we’ll end up with a different hyperplane which is formed with respect to the new support vectors, i.e. the next closest data points.

The Mathematics behind SVM

Hinge Loss

The loss function measures the amount of precision of a model’s prediction. Intuitively, if the predicted values deviate a lot from the true values, then the Loss function would be a very high value and vice versa.

The Loss function of SVM is similar to that of Logistic regression. The hypothesis of SVM (linear kernel) is as follows:

Let us now have a look at the hinge loss plot with respect to the hypothesis.

We can see that it’s quite similar to the Logistic regression loss function but the slope is sharp and it turns out to become 0 at points 1 and -1.

Coming to the hinge loss, in the case when

y=1:

When θᵀx >= 1 , cost=0.
When θᵀx < 1, cost increases as the value of θᵀx decreases.

y=0:

When θᵀx <= -1 ,cost=0.
When θᵀx > -1, cost increases as the value of θᵀx increases.

There might be a confusion, that we already set hypothesis to be 1 when θᵀx>=0. But, in this case we consider it to be a little different from our original hypothesis. This is basically Hinge Loss of our SVM done by the support vectors for better precision. Because of this, the data points between margin and decision boundary will be 0< θᵀx <1.

Cost function

Now, using this hypothesis equation and by keeping the hinge loss in mind, we have a way to compute the cost function which is used by SVM for classification purposes.

where m is the number of training examples.

Sometimes our model tends to overfit. The best way to solve this issue is regularization.

Regularization is basically adding a constant to the cost function. So, when the function gets minimized to obtain parameters producing minimum errors, this regularization would make the equation less complex and fit better even for new sets of examples.

Therefore, the regularized cost function is given by:

where m is the number of examples and n is the number of features.

C is the regularization factor. When C is very large, it is similar to no regularization. When the equation is too complex, C is made small, so that regularization makes the equation less complex.

Equation of hyperplane

The equation of the hyperplane created by the SVM algorithm is given by:

Here, b is a constant, x is the vector of examples in the dataset, w is the weight vector that we calculate, i.e. the vector of all the parameter values we calculate for the example vector in order to obtain the correct hyperplane which classifies our dataset with maximum accuracy.

We can see that this hyperplane is similar to the equation of a straight line,

Let me explain in a little more detail. Let’s take a very simple example.

In this example, we have 3 (-) examples and 2 (+) examples. Using this we have trained our SVM model for classifying data points into (-) and (+) data types.

NOTE:

The original dataset will have about 100,000 or above examples. This is just a simplified version.

In this example, the green line is the decision boundary predicted by our SVM model and the yellow region is the margin on both sides of the decision boundary formed with the help of support vectors.

Now let us take the vector perpendicular to our decision boundary (w).

Suppose we have a new example ‘u’ which we want to classify as (+) or (-).

projection of example vector on theta vector

The SVM algorithm projects this new data point vector, the u vector as seen in the above graph, onto the perpendicular vector of the decision boundary, i.e. the w vector.

According to our basic mathematics knowledge, when we project one vector on another, we take the dot product of one vector with respect to the other, which is compared with a particular constant- basically the decision boundary.

If the projection is greater than the constant, then it is of a particular type, else the other type.

In our example, if our dot product is lesser than the constant, then it will be on the left half of the decision boundary, hence will be classified to be (-).
If our dot product was greater than the constant, it will be classified as (+) as it is on the right side of the decision boundary.

Therefore, we get this equation:

Here, y=1 for all positive values, i.e. greater than or equal to the decision boundary, and y=0 for all negative values ((i) here refers to the ith example).

This is the major reason for SVM to be a large margin classifier.

When we project an example vector on the perpendicular vector, the formula is given by:

where p(i) is the projection of the example vector on the perpendicular vector.

Let us now consider a situation where the projection is small.

So, in this case, p(1) is the projection of the nearest example of X type onto the theta vector and p(2) is the projection of the nearest example of O type. We know that P(1)*(theta)>=1 when y=1 (when example is of X type) and P(2)*(theta)<=1 when y=0 (O type).

In such a situation, when our P(1) and (-P(2)) are both small, for our algorithm to classify accurately, the absolute value of theta must be large.

However, our model’s main objective is to find the decision boundary with theta as minimum as possible. Therefore, small projections make our model inaccurate.

This is why our SVM algorithm is a Large Margin Classifier.

Finally, let us take a look at the kernels of SVM and why do we use Kernels in the SVM algorithm.

Different Kernels in SVM

Linear Kernel
Gaussian Kernel
Polynomial Kernel
Radial basis Kernel
Sigmoid Kernel, and so on.

SVM uses a set of mathematical functions known as kernels. For different types of datasets, different kernels work better.

In the example discussed earlier, a linear decision boundary line was enough for classification. Therefore, a linear kernel works better.

Let’s say we need non — linear decision boundary, like for example, we have a dataset like this

We would need a different kernel rather than a linear kernel because a line would not be the best fit. In this case, we could use the Gaussian kernel instead, which would create a decision boundary like this

Similarly, according to the dataset provided, we could change the kernel to obtain the best fitting hyperplane.

Few important formulas when we use Kernels in SVM

Linear kernel :

Polynomial kernel:

Gaussian kernel:

RBF kernel:

Sigmoid kernel:

Conclusion

SVM is one of the most widely used classification algorithms for high-dimensional datasets. Multi-class classification could also be done easily using the SVM algorithm. Usage of various kernels based on the dataset makes the algorithm classify with higher accuracy. Since SVM has a convex optimization problem, it will always find the global minimum. Therefore, it’s always good to know the working of SVM and also try using it for various applications.