Data Science and ML

One Stop for Support Vector Machine

Priyansh Soni
10 min readFeb 25, 2022

Support Vector Machine is one of the most popular Machine Learning algorithms that is parametric and linear. Despite of being very popular, the working of this algorithm is quite simple. Well, simple in terms of the brains used behind it, but math is an ordinary giant.

This informatory article combines valuable insights from various content available on the web, which I went through to understand SVM to an extent that now I can explain it to a 5-year-old. confidence++

Disclaimer: This article adheres to the strict theoretical explanation of SVM for the ease of understanding of every individual.

1. What is SVM?

SVM stands for Support Vector Machine and is a supervised Machine Learning algorithm that is used for both Classification and Regression.

The algorithm aims to find a hyperplane in the n-dimensional feature space such that the boundary distinctly classifies the datapoints.

In layman's terms,

The objective of the algorithm is to find a decision boundary, such that the boundary separates the classes distinctly. Look at the image below for a better understanding

Linear separation between classes (2D)

The blue line in the above picture can separate the two classes distinctly, and this is what SVM aims to do — try to come up with a line/hyperplane, which can separate the classes.

2. What is a Hyperplane?

In simple terms, a Hyperplane is a decision boundary that separates classes.

It is called a hyperplane because its orientation varies in different dimensions. In 2D, it is a straight line separating classes as seen in the image above, in 1D, it is a point, and in 3D, it is a plane separating the classes. In dimensions more than this, the orientation of the plane changes, and hence to generalise it, we call it a hyperplane.

Most common images available on the internet

So basically, it’s just a plane.

3. Optimal Margin Hyperplane

The hyperplane with the maximum margin is considered the optimal margin hyperplane for SVM.
A hyperplane, that has the largest margin between the separating points and itself, is called the maximum margin hyperplane.
Look at the below image:

All the lines(a to f) separate the classes — red and blue, perfectly.
But for optimal algorithmic working, we need to choose a line that best separates the points and is also a good separator for new test points.

If we see clearly, the line ‘f’ is the optimal line that we might wanna choose. This is because it separates the points perfectly and has the maximum marginal distance from both the classes.

By maximum marginal distance, I mean this:

Maximum marginal distance between points and hyperplane
  • We define a margin to the hyperplane as shown by the dotted lines in the image above.
  • The distance is measured between the points that lie on the margin and the hyperplane.
  • These points, that lie on the margins, are called Support Vectors, hence the name SVM.

Logical aspect: The distance of the support vectors from the hyperplane should be maximum to differentiate the classes better. The greater the distance would be, the more accurately the classes will be distinguished.
For example: if the marginal distance is very large, it indicates that the classes are far away from each other, and hence are separable. And if the distance is very close, it indicates that the classes are barely separated and are very close to each other. This can be seen in the image below :

Therefore, for an optimal algorithm, we would want a model that is easily able to distinguish between the classes. And, this is only possible if the marginal distance is maximum, which results in the maximum margin hyperplane.

3. Algorithmic working of SVM

Let’s consider a binary classification for now.

ALGO:

  • Identify the type of data — linear/non-linear.
  • Based on the data that we have, identify the correct hyperplane that can separate the classes.
  • Choose the points nearest to the hyperplane. These are now the support vectors.
  • Based on the distance between the points and the hyperplane, we adjust our hyperplane to achieve the maximum margin.
  • Once we have a maximum margin hyperplane, we are done.
    This is our SVM model

The above points explain the working of an SVM model. While implementing the algorithm, the above steps are considered and formulated in code.

4. Properties of SVM

An optimal SVM model diagram
  1. The region above the hyperplane is called the Positive Hyperplane and the region below it is called the Negative Hyperplane.
  2. The data points that are closer to the hyperplane and influence the position and orientation of the hyperplane are called Support Vectors.
  3. The distance of the support vectors from the hyperplane is called the Margin.
  4. The marginal distance from the hyperplane to the points is represented in terms of vectors.
  5. We aim to maximize this marginal distance for the model to predict better.
  6. For ease of use and computation, the margins are evaluated at a 1-unit distance from the boundary. Hence the positive hyperplane margin is +1 units from the boundary and the negative hyperplane margin is at -1 units.
    This is useful for evaluating the loss function for SVM.

5. Misclassification in SVM

If a point is predicted as class A but was class B, then this is a misclassification and hence should be penalized. Let’s look at the image below:

  • The point x1 is misclassified as it belongs to class B but is predicated on the wrong side of the boundary in class A. Similarly, the point x2 is misclassified, since it is predicted on the wrong side of the boundary in class B.
  • Now, if point x1 is predicated on the farther wrong side of the boundary, then the loss of misclassification should be more. Similarly, if point x2 is predicted closer to red points, then the loss should be more.
  • And for a green point, if it is farther more on the green side, i.e. — farther more inside of class A, then the loss should not matter since it is correctly predicted no matter how far it is on the right side of the boundary. And same for the red point being present far inside class B.

Hence for a point, if predicted on the wrong side of the boundary, the misclassification loss should be more, and for a point predicted on the right side of the boundary (no matter how far right), the loss should be negligible(0).

We can formulate the above logic and define a loss for the SVM model. This loss is defined as the Hinge Loss.

6. Loss function of SVM

Hinge loss is a function that penalizes the model when it makes an incorrect prediction (misclassification). Let’s look at the graph of Hinge Loss first:

Hinge loss for SVM

In the above diagram, the x-axis represents the level of misclassification and the y-axis represents the loss incurred on misclassification.

Based on this, correct predictions should fall on the right side of the graph and incorrect predictions should fall on the left side of the graph. The dotted lines in the image above represent the marginal boundaries of the hyperplane which is used for separation. Therefore making correct predictions (right side of the margin) should account for almost 0 loss and making incorrect predictions should account for loss > 0. The predicted points that fall on the boundary (x = 0) are considered as neither correctly nor incorrectly classified.

  • If the distance of the predicted point (ŷ) from the boundary is 1 unit on the right side of the boundary, then the loss occurred is almost 0. This is because the point lies on the margin and is correctly classified.
  • If the distance of the point from the boundary is 0 units, it means that the point lies on the hyperplane/boundary, hence the loss that occurred is 1. This is because the point belongs to neither class A nor class B. Hence the loss is 1.
  • As the distance of the test point from the boundary increases towards the wrong side, the loss increases.

The above statements can be considered as:
If the point is on the right side of the hyperplane/decision boundary, then the loss is minimal, no matter how far the point is on the right side of the boundary (towards the blue region in the above graph)
And if the point is on the wrong side of the boundary, then the loss increases as the point moves away from the boundary (towards the red region in the above graph)

So, the above graph can be formulated as :

Hence the formula for Hinge loss is :

  1. When y and ŷ have the same sign, then the loss is minimal, since the points lie on the same side of the margin — right side.
  2. When y and ŷ have opposite signs, then the loss becomes large since the points lie on the opposite sides — the wrong side.
  3. When ŷ is 0, then the loss is 1 — a point lies on the margin

7. Regularisation in SVM

Now, to understand the concept of regularisation, we need to know the equation of lines to the support vectors and the math behind the calculations. Well, the math is easy to grasp and operates in the vector space. And, to keep this article as uncluttered as possible, I’ve combined the math of the algorithm in the article below:

An optimal marginal hyperplane is one where the distance of the margins from the boundary is maximum. Considering this, a line extended perpendicular to the hyperplane can represent the optimal margin, because extending this line more will thereby extend the width of the margin more, resulting in the maximum margin for that hyperplane.

Regardless, of the math, let’s assume that this orthogonal line that links the support vectors and hyperplane is represented in terms of a vector — w. Therefore, to increase the width of the margin, we have to increase this vector w.
Now since this is a vector, for it to be a scalar used in calculations, we have to divide it by its magnitude — |w|. Therefore, we have to maximize the below term:

Now, to maximize the above equation, we can just minimize the denominator. Hence, to achieve a maximum margin hyperplane, we need to minimize the magnitude of the vector — |w|

Therefore, the regularisation term becomes:

Since modulus is hard to differentiate, we square this term for it to be easily differentiable. This term becomes the L2 regularisation for SVM.

When calculated for all the features, the regularisation can be represented as:

Where n is the number of features.

8. Cost function of SVM

After adding the regularisation in SVM and computing the loss function for the entire dataset, the cost function for SVM can be defined as:

Here, the first term is regularisation and the second term is the loss function (Hinge Loss) computed over the entire training.
The first term is computed for the number of features (n), while the second one is computed for the number of training data samples (m).

9. Regularisation Parameter — C

C is called the regularisation parameter and is attached in front of the Hinge Loss to control the penalty for misclassification.

In SVM, we apply the regularisation parameter (C) in front of the Loss Function. This parameter acts opposite to the actual regularisation parameter ( λ ), which is often applied in front of the regularisation function.

  • When C increases, the model overfits the data
    This is because, when C increases, the model penalizes more on misclassification and hence the bias (prediction error) becomes much less. This leads to strictly accurate predictions on the training data which thereby leads to overfitting.
  • When C decreases, the model underfits the data
    This is because, when C decreases, the model allows misclassification and does not penalize. This leads to an increase in bias and a decrease in variance (variability of model for new data). Hence the model loses information and therefore underfits.

--

--

Priyansh Soni

Bet my articles are the best easy explanation you could get anywhere. I am a product enthusiast nurturing technology to transform change.