Support Vector Machine (SVM) Explained

Sanjay Singh
Oct 23, 2019 · 6 min read

Support Vector Machine (SVM) is one of the most powerful classifiers in Machine learning. At the same time, it is one of the most difficult to understand.

In order to understand SVM, you might have to revisit some basic high school level math concepts.

Two critical concepts to master are Vector and Dot products between vectors.

Dot Product

Dot product between vectors results into scalar number and it shows similarity between those vectors.


X1=[1,2,3, 4]

X2 =[5,6,7,8]

Dot product of X1 and X2 will be 1*5+2*6+3*7+4*8=5+12+21+32=70

Python Code:

import numpy as np



Along with indicating the similarities, dot product between two vectors of unit length return cosine of angle between them.

Linear Separation

Let’s start with a simple case. Consider below visualization. A company has recorded age and income of it’s prospective customers and whether or not they purchased its product.

Below is visualization of what the company recorded.

Now if you are given a a list of new prospective customers age and income, you should be able to predict whether or not they will purchase the product.

Let’s use SVM.

Line Separator

Looking at above visualization, first thing that comes to mind is draw a straight line and separate stars from red dots. Let’s do that.

Looks some what better now. But, we can draw this separator line several ways. Some of them are shown in below visualization.

And based on which separator line we choose, the prediction of new data point (green square) will be different.

So, which separator is correct?

This is where the concept of maximum margin comes into play.

The challenge is to come up with a separator line from which the nearest vectors on either side of the line has maximum margin.

The vectors in blue circle in above visualization are called support vectors. These vectors are considered for maximum margin.

SVM task is find a separator which separates two classes (Yes and No)with the constraint that margin between the separator and nearest vectors should be maximum.

Let’s find equation representing the separator. Draw a perpendicular vector w on the median line of separator plan. This perpendicular vector (also called normal vector) w will indicate the orientation of median line. Considering b as distance of median line from origin the equation of the median line will be


However, you can draw several perpendicular line to the separator. That means there can be several w vector because we have not yet fixed the length of this vector. Let’s fix the length of vector w.

w.x+b≥1 for Yes points

and w.x+b≤1 for no points

Now we have fixed size of w.

Let’s combine above two equations in one equation. For that we need a new variable y such that

y = +1 for Yes points

and y = -1 for No points

With addition of y in above equation, we will get


For support vectors y(w.x+b)-1=0……………………………>Eq(1)

Points on Yes side will have value y(w.x+b)-1>0 ……………..>Eq(2)

and points on No side will have value y(w.x+b)-1<0……………>Eq(3)

These equations are called decision rules.

Now that we got equation for vectors in separator plane and vector on either side of it, lets find an equation for the width of separator plane.

The distance between support vectors Xa and Xb in above visualization can be calculated as

Replace Xa and Xb with the equation above Eq 2 and Eq 3. For Xa y will be 1 so for Xa

w.Xa +b -1 =0

w.Xa = 1-b

and for Xb y will be -1 so for Xb




Eq(4) will further expand to

Maximizing the width of separator plan means

Let’s summarize what we got so far.

This is optimization problem. We need to use Lagrange multiplier to solve this optimization problem.

Lagrange Multiplier

Formula for Lagrange Multiplier is below

L = Optimization Target — Summation of Constraints with a multiplier

For us the optimization target is minimization as mentioned in Eq (7) and constraint is decision rule wx+b-1. Let’s put these in Lagrange Multiplier

To find maximum (or extreme) point of something you need to find it’s derivative and set it to zero. We need to differentiate L with respect to w and b.

This result to


This equation indicates w is linear sum of some of the sample vectors X.

Let’s look at L differentiation with respect to b.

Let’s get back to Lagrange Multiplier and replace w based on above outcome.

Using Eq(12) we can make the yellow background part of Eq(14) zero. This reduces the equation to below.

Eq(16) shows that optimization of the width purely depends on dot product of support vectors.

Now, let’s rewrite decision rule w.x+b by replacing w with Eq(10). Consider we got new vector u and we want to know whether this will go to Yes class or No class.

If below is true than new vector u belongs to Yes class.

If above is less than 0, new vector belongs to No class.

The highlighted part of the equation shows that decision rule also depends on dot product. Now you know why I started this article with explaining dot product.

And so far we only covered linear sample. Non-linear sample and Kernels are still there to explore.

Non-Linear Separation & Kernel Function

So far we have worked on a case where the data was linearly separable. What if the data is not linearly separable. Let’s keep it for next article.



Data Science, Machine Learning and Artificial Intelligence

Sanjay Singh

Written by



Data Science, Machine Learning and Artificial Intelligence

More From Medium

More on Machine Learning from Sanrusha

More on Machine Learning from Sanrusha

More on Machine Learning from Sanrusha

Machine Learning Cross (K-fold) Validation Introduction

More on Machine Learning from Sanrusha

More on Machine Learning from Sanrusha

Ensemble Learning

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade