Support Vector Machines (SVM)
Support Vector Machines (SVM) is one of the most powerful out of the box supervised machine learning algorithms for classification problems. Unlike many other machine learning algorithms such as neural networks, you don’t have to do a lot of tweaks to obtain good results with SVM.
Key Terms
- Support Vectors: The data points closest to the maximum or optimal hyperplane. These vectors decide the Positive and Negative hyperplanes that we use for classification.
- Margin: The distance between the support vectors and the maximum hyperplane. Maximized Margin is the distance between the two support vectors.
- Hyperplanes: Decision boundaries that classify data points into their respective classes in a multi-dimensional space.
- Optimal or Maximum Hyperplane: The hyperplane which maximizes the margin of the training data.
In 2D, for a data point, it is a equation of a line.
In 2D, for a set of data points, it takes a form,
- Positive and Negative Hyperplanes: Hyperplanes passing through support vectors.
Before we try to understand SVM lets refresh some of the concepts of Vectors.
Vectors
Vectors are anything that has a magnitude and a direction.
Length of a Vector
The length of a vector U is called its norm, which is written as ||U||.
Direction of a Vector
The direction of a vector U is written as w, and is defined as.
The sum of two Vectors
Given two vectors, u(u1, u2) and v(v1, v2) then the sum of the vectors is given by.
Dot Product
Product of the magnitudes of the two vectors, and the cosine of the angle between the two vectors.
The dot product tells us what amount of one vector goes in the direction of another.
or
In a generic form the dot product can be represented as below. We will go into more details around it later in the article.
Concepts
There are mainly two concepts behind SVM, linear and non-linear separability.
Linear Separability
Where data can be separated linearly by a point, line, plane or a hyperplane.
Linear separability is one important concept in SVM. Although in practical cases the data might not be linearly separable, we will start from the linearly separable cases (since they are easy to understand and deal with) and then derive the non-linearly separable cases.
Non-Linear Separability
Where data cannot be separated linearly by a point, line, plane or a hyperplane.
Hyperplanes
Hyperplanes can be considered decision boundaries that classify data points into their respective classes in a multi-dimensional space. Data points falling on either side of the hyperplane can be attributed to different classes.
A hyperplane is a generalization of a plane:
- In two dimensions, it’s a line.
- In three dimensions, it’s a plane.
- In more dimensions, you can call it a hyperplane.
Understanding the equation of a Hyperplane
Two Dimension
In the two-dimension data can be separated by a line. We already know the equation of a line.
For any give data (x1, y1) the equation of a line will be,
Or,
In a matrix form we can represent it as,
Higher Dimension
When reading about hyperplane, we will often find that the equation of an hyperplane for more than two-dimension is defined as.
The reason a hyperplane in more than two dimensional space is represented in such a way is because its difficult to represent an equation in higher dimension clearly.
We will learn how these two hyperplane equations relate down later.
How do these two forms relate?
In our hyperplane equation w and x are vectors. Moreover, this is how we compute the inner product of two vectors, and if we recall, the inner product is just another name for the dot product.
Note that,
Is the same thing as,
Given two vectors,
We get,
So, the two equations are just different ways of expressing the same thing. It is just that one way is better that other in representing higher dimensions.
Now that we understand some of the concepts used for building SVM let’s try to understand how SVM actually works.
How SVM works?
SVM uses the dot product, to calculate where the data point lies, negative or the positive plane. Let’s understand it with an example.
The goal of SVM is to design a hyperplane that separates all the training data in to its classes. SVM creates multiple hyperplanes and chooses the one that leaves the maximum margin from the separated classes.
We have seen earlier what this hyperplanes look like in different dimension.
Now when the new data, say X, comes in SVM tries to identify which class this new data belongs to. To do so it follows below steps.
Here,
X: The unclassified data point
x: Distance of X from (0, 0)
c: Distance of Maximum Hyperplane from (0, 0)
w: Projection of x on c a.k.a. Dot Product. w = x . c
If x*w = c, then the point X lies on the Maximum Hyperplane.
If x*w > c, then the point X lies on the Negative Hyperplane.
If x*w < c, then the point X lies on the Positive Hyperplane.
It is still possible that a data point may lie in between the Maximum and Positive or Negative hyperplanes.
For the classification to be accurate the data point should lie above the Positive or Negative Hyperplanes.
To make it possible let’s assume the margin for Positive and Negative planes is equal to 1 each. And with this consideration the new equation will be.
For the Negative Hyperplane.
If True, data lies in a Negative hyperplane.
For the Positive Hyperplane.
If True, data lies in a Positive hyperplane.
Optimization
The purpose of the optimization in SVM is to maximize the margin so that the Positive and Negative hyperplanes are clearly apart. This in turns helps SVM to classify data clearly.
From the equation for the Positive and Negative hyperplane above we can calculate the difference between these two planes.
Divide both sides by ||w||
The objective here is to maximize this value because the farther the positive and negative hyperplanes are the better will be the classification.
To maximize this margin we need to update (w, C) such that,
In simplified form we can write the equation as,
Regularization
The inverse of optimization function will give us Regularization function. It is still subjected to the condition above.
In reality while solving SVM problems we will have overlaps and errors, misclassifications. If we simply try to fit the non linear data linearly we will end up in an overfitting scenario. So to avoid any overfitting we need to consider the errors in our regularization function.
So, our Regularization function will be,
Update (w, C) within the boundary of the below equation,
Example
Here we have two misclassified data points and the calculated error for each of these data points are E1 and E2.
Let’s assume,
The Regularization function will look like -
Lagrange Formulation
You can check this link to understand how Lagrange works.
Lagrange formulation are used in multivariable calculus or in quadratic equations to minimize or maximize a function subject to linear constraints.
When we have a optimization problem under some conditions or constraints we use Lagrange. It gives us a new optimization equation with new conditions or constraints.
Let’s assume we have a function,
With a given condition or constraint,
To solve this using Lagrange lets introduce a new parameter lambda, lagrangian multiplier.
One way to solve this problem is to define a new function, F, such that,
Now, we have four parameters, so we need four equations to find the values for each of these parameters. These equations can be formulated as shown below.
Apply Lagrange
SVM uses these lagrangian method to optimize the model. We can try to optimize without lagrange but as the dimensions increase the performance decreases.
Here we will see how SVM regularizes the model using lagrangian multiplier.
Regularization
Constraint
Here, we are optimizing a quadratic equation with linear constraint. Now, this leads us to find the solution for dual problems.
In optimization, the duality principle states that optimization problems can either be viewed from a different perspective: the primal problem and the dual problem The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.
To solve the above quadratic problem with linear constraints, we can use the method of Lagrange.
The Lagrange function is represented as,
Applying it to our regularization function we get,
This equation is Primal formulation.
To solve the primal equation, we set the following conditions,
Applying these derivatives we get,
Now, we will apply Karush-Kuhn-Tucker (KKT) Conditions.
What this condition states is that for a given function with a lagrangian multiplier.
If,
Then,
For simplicity let’s assume error = 0,
Applying KKT we will get,
Few other important conditions related to largarne multiplier,
This is very useful because SVM now has to only compute and use these support vectors for classification instead of checking all the other data points which are quite a lot. These is also computationally efficient.
Let’s plugin the equations from the derivatives to out primal function.
Plugging all these values in our primal function gives us the following formulation, also called the Dual formulation,
Now, SVM will maximize the above equation subject to the following conditions,
And,
This is how SVM uses lagrangian function to optimize the model.
Kernel Trick
SVM often needs to transform data from lower to higher dimension for better separability.
There could be a lot of new dimensions each of them possibly involving a complicated calculation. Transforming each of the vectors to a higher dimension can be a lot of work.
Let’s understand these with few simple examples.
2D to 3D Transformation — Single Vector
Here we have a vector u given by coordinates (x, y).
To transform this vector into a 3D space, as motioned above, we will take a dot product of this vector with itself and the result of this dot product will give us our third dimension.
So now we can represent our vector u in 3D space as,
2D to 3D Transformation — Two Vectors
Now, lets say we have 2 two dimensional vectors u and v and we want to transform both into a 3D dimensional vectors.
As we have seen above the third dimension is given by,
We can represent u and v in third dimension as,
As we know from the section above SVM will now take the dot product, in this new dimension to calculate where the data point lies, negative or the positive plane.
Even Higher Dimensions
Lets assume another scenario where we have data in three dimension space and we want to transform it into a nine dimensional space.
Using the method above these vectors in nine dimension will look like,
Dot product of these vectors can be given by,
In a general form we can represent the dot product as,
This calculation will keep growing as we add more and more dimensions which will eventually create performance issues. The computational complexity, in this case, is O(n²).
So how can SVM achieve these without compromising the performance?
Here’s a trick: SVM doesn’t need all of these new dimensions to work its magic — it actually can get by with only the dot products between the original vectors in the original dimension.
This is when the kernel trick comes in. It allows us to operate in the original feature space without computing the coordinates of the data in a higher dimensional space. In other words we can just calculate the dot product of u — transpose and v in the original dimension.
The kernel function is denoted by k(x, y). The dot product of u — transpose and v is given below. The computational complexity, in this case, is O(n).
As we can see we really don’t need to project all our data into new dimensions. SVM can just calculate the dot products of these vectors in the original dimension and give us the same results.
Types of Kernel Functions
Polynomial Kernel
It is popular in image processing.
d is the degree of the polynomial.
Gaussian Kernel
It is a general-purpose kernel; used when there is no prior knowledge about the data.
σ is the variance and our hyperparameter.
Gaussian Radial Basis Function (RBF)
It is a general-purpose kernel; used when there is no prior knowledge about the data.
Laplace RBF Kernel
It is general-purpose kernel; used when there is no prior knowledge about the data.
σ is the variance and our hyperparameter.
Sigmoid Kernel
We can use it as the proxy for neural networks. Equation is:
alpha is slope and c is intercept.
Advantages
- SVM works well with even unstructured and semi structured data like text, Images and trees.
- It scales relatively well to high dimensional data.
- The risk of overfitting is less in SVM.
- Works well with even unstructured and semi structured data like text, Images and trees.
Disadvantages
- Choosing a “good” kernel function is not easy.
- Long training time for large datasets.
- Difficult to understand and interpret the final model, variable weights and individual impact.
- It is hard to tune the hyperparameters, gamma, of SVM.
I hope this article provides you with a good understanding of Support Vector Machines.
If you have any questions or if you find anything misrepresented please let me know.
Thanks!