SVM DUAL FORMULATION

sathvik chiramana
6 min readOct 1, 2019

--

Support Vector Machine (SVM) is a supervised Machine Learning algorithm used for both classification or regression tasks but is used mainly for classification. In this blog we will mainly focus on the classification and see how svm internally works.

After going through this article you can get a grasp of the following concepts

  1. How SVM works?
  2. What is Lagrange Multiplier and how it is used in svm?
  3. What is SVM Dual Formulation?

Main Task of SVM:

The main task of svm is to find the best separating hyperplane for the training data set which maximizes the margin.

In here there are many hyperplanes that can seperate two classes and svm will find a margin maximizing hyperplane

Reason for margin maximizing hyper plane:

The smaller the margin more the chances for the points to get misclassified. If we keep the margin as wide as possible we are reducing the chances of positive/negative points to get misclassified.

Maths behind SVM

Taken from stack exchange

Here hyperplane plane w`x + b = 0 is the central plane which separates the positive and negative data points . w`x + b = 1 is the plane above which are the positive points lies and w`x + b = -1 is the plane below which all negative points lies .

Now the margin is the distance between the planes w`x + b = 1 and w`x+b= -1 and our task is to maximize the margin.

And we can find that the distance between those 2 hyperplanes is 2/||w||(refer this) and we want to maximize this distance

HARD MARGIN SVM:

In hard margin svm we assume that all positive points lies above the π(+) plane and all negative points lie below the π(-) plane and no points lie in between the margin. This can be written as the constraint y_i * (w`x_i+b)≥1

Now the whole optimization function can be written as

MAX(w) { 2/||w|| } such that y_i * (w`x_i + b ) ≥1

MAX(w) { 2/||w|| } can be written as min(w){||w||/2} and we can also rewrite it as min(w){ ||w|| ²/2}

How do we find the solution to an optimization problem with constraints?

In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints.

Lagrange multipliers

Lets take a simple example and see why lagrange multipliers work

minimize (x,y) f(x,y)=x^2+y^ 2 →(1)

such that h(x,y)=x+y−1=0

we can rewrite the constraint as y=1-x →(2)

Now draw the equations (1) and (2) on the same plot and it will look something like this

Lagrange found that the minimum of f(x,y) under the constraint g(x,y)=0 is obtained when their gradients point in the same direction. And from the graph we can clearly see that gradients of both f and g point in almost same direction at the point (0.5,0.5) and so we can declare that f(x,y) is minimum at (0.5,0.5) such that g(x,y)=0

And we can write it mathematically as
∇f(x,y)=λ∇g(x,y) ==> ∇f(x,y)-λ∇g(x,y)=0

where ∇ denotes gradient ,and we are multiplying gradient of g with f because,the gradients of f and g are almost equal but not exactly equal so to make them equal we are introducing λ in that equation and this λ is called the lagrange multiplier

Now back to our SVM hard margin problem we can write it in terms of lagrange as follows

where alpha is the lagrange multiplier

Dual Form Of SVM

Lagrange problem is typically solved using dual form. The duality principle says that the optimization can be viewed from 2 different perspectives. The 1st one is the primal form which is minimization problem and other one is dual problem which is maximization problem

Lagrange formulation of SVM is

To solve minimization problem we have to take the partial derivative w.r.t w as well as b

Substitute all these in equation 7.1 then we get

And so the final equation will be

And the following optimization problem is called dual problem.

Why do we try to maximize lagrangian in SVM?

Let p∗ be the optimal value of the problem of minimizing ||w||²/2(the primal). The Lagrangian dual function has the property that L(w,b,α)≤p∗. It is a lower bound on the primal function. Instead of solving the primal problem, we want to get the maximum lower bound on p∗ by maximizing the Lagrangian dual function (the dual problem).

Soft Margin SVM

The idea in here is to not to make zero classification error in training, but to make a few errors if necessary. So our optimization constraints now becomes

where ζi is called Zeta

where zeta is the distance of a misclassified point from its correct hyperplane

However we also need to have a control on the soft margin. That is why we add parameter C, which tells us to find how important ζ should be

If the value of C is very high then we try to minimize the number of misclassified points drastically which results in overfitting,and with decrease in value of C there will be underfitting

And dual form in soft margin svm is almost same as hardmargin ,and the only difference is alpha value in soft margin should lie between 0 and C

Important observations from dual form svm are:

  1. For every Xi we have alpha(i)
  2. Dual form only involves ( Xi^T.Xj)
  3. Alpha(i) is greater than zero only for support vectors and for all other points it is 0
  4. So while prediction for a query point only support vectors do matter

--

--