Support Vector Machines (SVM)

Understanding Support Vector Machines

14 min readJun 22, 2022

Support Vector Machines (SVM) is one of the most powerful out of the box supervised machine learning algorithms for classification problems. Unlike many other machine learning algorithms such as neural networks, you don’t have to do a lot of tweaks to obtain good results with SVM.

Key Terms

Support Vectors: The data points closest to the maximum or optimal hyperplane. These vectors decide the Positive and Negative hyperplanes that we use for classification.
Margin: The distance between the support vectors and the maximum hyperplane. Maximized Margin is the distance between the two support vectors.
Hyperplanes: Decision boundaries that classify data points into their respective classes in a multi-dimensional space.
Optimal or Maximum Hyperplane: The hyperplane which maximizes the margin of the training data.

In 2D, for a data point, it is a equation of a line.

In 2D, for a set of data points, it takes a form,

Positive and Negative Hyperplanes: Hyperplanes passing through support vectors.

Before we try to understand SVM lets refresh some of the concepts of Vectors.

Vectors

Vectors are anything that has a magnitude and a direction.

Length of a Vector

The length of a vector U is called its norm, which is written as ||U||.

Length of a Vector

Direction of a Vector

The direction of a vector U is written as w, and is defined as.

Direction of a Vector

The sum of two Vectors

Given two vectors, u(u1, u2) and v(v1, v2) then the sum of the vectors is given by.

Sum of Vectors

Dot Product

Product of the magnitudes of the two vectors, and the cosine of the angle between the two vectors.

The dot product tells us what amount of one vector goes in the direction of another.

Vector Multiplication

In a generic form the dot product can be represented as below. We will go into more details around it later in the article.

Dot Product

Concepts

There are mainly two concepts behind SVM, linear and non-linear separability.

Linear Separability

Where data can be separated linearly by a point, line, plane or a hyperplane.

Linear separability is one important concept in SVM. Although in practical cases the data might not be linearly separable, we will start from the linearly separable cases (since they are easy to understand and deal with) and then derive the non-linearly separable cases.

Non-Linear Separability

Where data cannot be separated linearly by a point, line, plane or a hyperplane.

Hyperplanes

Hyperplanes can be considered decision boundaries that classify data points into their respective classes in a multi-dimensional space. Data points falling on either side of the hyperplane can be attributed to different classes.

A hyperplane is a generalization of a plane:

In two dimensions, it’s a line.
In three dimensions, it’s a plane.
In more dimensions, you can call it a hyperplane.

Understanding the equation of a Hyperplane

Two Dimension

In the two-dimension data can be separated by a line. We already know the equation of a line.

Equation of a Line

For any give data (x1, y1) the equation of a line will be,

Or,

In a matrix form we can represent it as,

Higher Dimension

When reading about hyperplane, we will often find that the equation of an hyperplane for more than two-dimension is defined as.

The reason a hyperplane in more than two dimensional space is represented in such a way is because its difficult to represent an equation in higher dimension clearly.

We will learn how these two hyperplane equations relate down later.

How do these two forms relate?

In our hyperplane equation w and x are vectors. Moreover, this is how we compute the inner product of two vectors, and if we recall, the inner product is just another name for the dot product.

Dot or Inner Product

Note that,

Equation of a Line

Is the same thing as,

Equation of a Line

Given two vectors,

Transposed Parameter Vector

We get,

So, the two equations are just different ways of expressing the same thing. It is just that one way is better that other in representing higher dimensions.

Now that we understand some of the concepts used for building SVM let’s try to understand how SVM actually works.

How SVM works?

SVM uses the dot product, to calculate where the data point lies, negative or the positive plane. Let’s understand it with an example.

The goal of SVM is to design a hyperplane that separates all the training data in to its classes. SVM creates multiple hyperplanes and chooses the one that leaves the maximum margin from the separated classes.

We have seen earlier what this hyperplanes look like in different dimension.

Now when the new data, say X, comes in SVM tries to identify which class this new data belongs to. To do so it follows below steps.

Here,

X: The unclassified data point
x: Distance of X from (0, 0)
c: Distance of Maximum Hyperplane from (0, 0)
w: Projection of x on c a.k.a. Dot Product. w = x . c

If x*w = c, then the point X lies on the Maximum Hyperplane.
If x*w > c, then the point X lies on the Negative Hyperplane.
If x*w < c, then the point X lies on the Positive Hyperplane.

It is still possible that a data point may lie in between the Maximum and Positive or Negative hyperplanes.

For the classification to be accurate the data point should lie above the Positive or Negative Hyperplanes.

To make it possible let’s assume the margin for Positive and Negative planes is equal to 1 each. And with this consideration the new equation will be.

For the Negative Hyperplane.

If True, data lies in a Negative hyperplane.

For the Positive Hyperplane.

If True, data lies in a Positive hyperplane.

Optimization

The purpose of the optimization in SVM is to maximize the margin so that the Positive and Negative hyperplanes are clearly apart. This in turns helps SVM to classify data clearly.

From the equation for the Positive and Negative hyperplane above we can calculate the difference between these two planes.

Distance Between Positive and Negative Hyperplanes

Divide both sides by ||w||

The objective here is to maximize this value because the farther the positive and negative hyperplanes are the better will be the classification.

To maximize this margin we need to update (w, C) such that,

In simplified form we can write the equation as,

Regularization

The inverse of optimization function will give us Regularization function. It is still subjected to the condition above.

Regularization Function

In reality while solving SVM problems we will have overlaps and errors, misclassifications. If we simply try to fit the non linear data linearly we will end up in an overfitting scenario. So to avoid any overfitting we need to consider the errors in our regularization function.

So, our Regularization function will be,

Update (w, C) within the boundary of the below equation,

Example

Here we have two misclassified data points and the calculated error for each of these data points are E1 and E2.

Let’s assume,

Regularization Parameter

The Regularization function will look like -

Regularization Function

Lagrange Formulation

You can check this link to understand how Lagrange works.

Lagrange formulation are used in multivariable calculus or in quadratic equations to minimize or maximize a function subject to linear constraints.

When we have a optimization problem under some conditions or constraints we use Lagrange. It gives us a new optimization equation with new conditions or constraints.

Let’s assume we have a function,

Function

With a given condition or constraint,

Condition or Constraint

To solve this using Lagrange lets introduce a new parameter lambda, lagrangian multiplier.

Parameters

One way to solve this problem is to define a new function, F, such that,

Primal Formulation

Now, we have four parameters, so we need four equations to find the values for each of these parameters. These equations can be formulated as shown below.

Apply Lagrange

SVM uses these lagrangian method to optimize the model. We can try to optimize without lagrange but as the dimensions increase the performance decreases.

Here we will see how SVM regularizes the model using lagrangian multiplier.

Regularization

Constraint

Here, we are optimizing a quadratic equation with linear constraint. Now, this leads us to find the solution for dual problems.

In optimization, the duality principle states that optimization problems can either be viewed from a different perspective: the primal problem and the dual problem The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.

To solve the above quadratic problem with linear constraints, we can use the method of Lagrange.

The Lagrange function is represented as,

Lagrange Function

Applying it to our regularization function we get,

Lagrange Primal Formulation

This equation is Primal formulation.

To solve the primal equation, we set the following conditions,

Derivatives

Applying these derivatives we get,

Now, we will apply Karush-Kuhn-Tucker (KKT) Conditions.

What this condition states is that for a given function with a lagrangian multiplier.

Lagrange Function

If,

Lagrange Multiplier

Then,

For simplicity let’s assume error = 0,

Error

Applying KKT we will get,

Lagrange Multiplier

Few other important conditions related to largarne multiplier,

This is very useful because SVM now has to only compute and use these support vectors for classification instead of checking all the other data points which are quite a lot. These is also computationally efficient.

Let’s plugin the equations from the derivatives to out primal function.

Error

Lagrange Primal Formulation

Plugging all these values in our primal function gives us the following formulation, also called the Dual formulation,

Dual Formulation

Now, SVM will maximize the above equation subject to the following conditions,

Lagrange Multiplier

And,

Condition or Constraint

This is how SVM uses lagrangian function to optimize the model.

Kernel Trick

SVM often needs to transform data from lower to higher dimension for better separability.

There could be a lot of new dimensions each of them possibly involving a complicated calculation. Transforming each of the vectors to a higher dimension can be a lot of work.

Let’s understand these with few simple examples.

2D to 3D Transformation — Single Vector

Here we have a vector u given by coordinates (x, y).

2D Vector

To transform this vector into a 3D space, as motioned above, we will take a dot product of this vector with itself and the result of this dot product will give us our third dimension.

Dot Product

Third Dimension

So now we can represent our vector u in 3D space as,

3D Vector

2D to 3D Transformation — Two Vectors

Now, lets say we have 2 two dimensional vectors u and v and we want to transform both into a 3D dimensional vectors.

2D Vectors

As we have seen above the third dimension is given by,

Third Dimension

We can represent u and v in third dimension as,

3D Representation

As we know from the section above SVM will now take the dot product, in this new dimension to calculate where the data point lies, negative or the positive plane.

Even Higher Dimensions

Lets assume another scenario where we have data in three dimension space and we want to transform it into a nine dimensional space.

3D Vectors

Example

Using the method above these vectors in nine dimension will look like,

Dot product of these vectors can be given by,

Dot Product — Complexity O(n²)

Example

In a general form we can represent the dot product as,

This calculation will keep growing as we add more and more dimensions which will eventually create performance issues. The computational complexity, in this case, is O(n²).

So how can SVM achieve these without compromising the performance?

Here’s a trick: SVM doesn’t need all of these new dimensions to work its magic — it actually can get by with only the dot products between the original vectors in the original dimension.

This is when the kernel trick comes in. It allows us to operate in the original feature space without computing the coordinates of the data in a higher dimensional space. In other words we can just calculate the dot product of u — transpose and v in the original dimension.

The kernel function is denoted by k(x, y). The dot product of u — transpose and v is given below. The computational complexity, in this case, is O(n).

As we can see we really don’t need to project all our data into new dimensions. SVM can just calculate the dot products of these vectors in the original dimension and give us the same results.

Types of Kernel Functions

Polynomial Kernel

It is popular in image processing.

Polynomial Kernel

d is the degree of the polynomial.

Gaussian Kernel

It is a general-purpose kernel; used when there is no prior knowledge about the data.

Gaussian Kernel

σ is the variance and our hyperparameter.

Gaussian Radial Basis Function (RBF)

It is a general-purpose kernel; used when there is no prior knowledge about the data.

RBF

Laplace RBF Kernel

It is general-purpose kernel; used when there is no prior knowledge about the data.

Laplace RBF

σ is the variance and our hyperparameter.

Sigmoid Kernel

We can use it as the proxy for neural networks. Equation is:

Sigmoid

alpha is slope and c is intercept.

Advantages

SVM works well with even unstructured and semi structured data like text, Images and trees.
It scales relatively well to high dimensional data.
The risk of overfitting is less in SVM.
Works well with even unstructured and semi structured data like text, Images and trees.

Disadvantages

Choosing a “good” kernel function is not easy.
Long training time for large datasets.
Difficult to understand and interpret the final model, variable weights and individual impact.
It is hard to tune the hyperparameters, gamma, of SVM.

I hope this article provides you with a good understanding of Support Vector Machines.

If you have any questions or if you find anything misrepresented please let me know.

Thanks!

Support Vector Machines (SVM)

Understanding Support Vector Machines

Key Terms

Vectors

Length of a Vector

Direction of a Vector

The sum of two Vectors

Dot Product

Concepts

Linear Separability

Non-Linear Separability

Hyperplanes

Understanding the equation of a Hyperplane

Two Dimension

Higher Dimension

How do these two forms relate?

How SVM works?

Optimization

Regularization

Example

Lagrange Formulation

Apply Lagrange

Kernel Trick

2D to 3D Transformation — Single Vector

2D to 3D Transformation — Two Vectors

Even Higher Dimensions

So how can SVM achieve these without compromising the performance?

Types of Kernel Functions

Polynomial Kernel

Gaussian Kernel

Gaussian Radial Basis Function (RBF)

Laplace RBF Kernel

Sigmoid Kernel

Advantages

Disadvantages

Written by Gajendra