Optimization of Support Vector Machine
A mathematical explanation of optimization of the linearly separable classifier using quadratic programming.
For linearly separable data the objective function of support vector machines(SVM) defined as;
if you want to know more about how we formulate it read part1.
This is a quadratic and convex objective function with linear constraints. So to optimize this objective function we need to introduce a Lagrange multipliers α’s; one for each in inequality constraints. This is one of the non-linear programming problem called the Quadratic programming problem.
Method of Langrangian multiplier deals with finding local minima or local maxima of a function subject to equality constraints. But in the objective function, we have an inequality constraint so we need to use a generalized version of the Lagrange multiplier method considering inequality constraints also. So, Karush-Kuhn-Tucker(KKT) condition generalised Lagrange multiplier method with considering inequality constraints as;
Now the Lagrangian(L) of the objective function becomes;
where α’s are Lagrange multipliers. Because linear constraint is greater than or equal to type; we make it less than or equal to by multiply by -1 to satisfy the KKT conditioned Lagrangian multiplier method. L has to be minimised concerning (w.r.t.) w and b while maximised for α’s. That means we need to find a saddle point.
In the above figure as you can see X is a saddle point and along with curve AB we minimizing the function and along curve CD maximizing its duality.
The Duality Theorem states that in a convex optimization problem and set of linear constraints if the primal problem has an optimal solution then the dual problem also has an optimal solution. And corresponding values are equal which we called a saddle point. Here the primal problem is to minimize an L with respect to W and b while dual problem is to maximize an L with respect to α’s.
If a linear constraint is violated then L can be increased by increasing the corresponding α but then W and b have to change such that L is decreased. This is the case when a data point is lies within the margin and W and b have to be changed to adjust the margin. For all constraints which are not met as equalities mean greater than 0 then corresponding α’s must be 0 to maximize L.
As we need to minimise L w.r.t. W and b we derivate it and equate it with zero. Such as ;
After derivate L w.r.t. w and b we get,
Rearranging the term of L and replacing the value we get above, then
Now, this is a dual problem which we need to maximise w.r.t. α with constraints as ;
This is solved using numerical methods of quadratic optimisation which gives optimal values of α’s for each training point by maximizing the dual problem L. The training points for which the optimal value of α greater than 0 are called support vectors. For other training points, the value of optimal α is zero which lie on that side of the marginal hyperplane such that;
this inequality holds strictly.
As you can notice in the duality of L formulated above, the inner product of two datapoints of dimension d is taken. As we know the inner product provides some measure of similarity or it tells how apart they are from each other. If they are alike then the inner product is 1 or if they are dissimilar then it is 0.
Insight about inner products regarding L:
So, let’s understand the duality of L in the context of the inner product. As we claim that if α’s are non-zero then the function will be maximized. Those are the support vectors that help to maximise the margin.
Notice from the constraints that all the α’s are positive. Let’s explore some cases that arise with it:
CASE 1: If two features are alike then their inner product is 1. Then there are two subclasses such as:
a) if both feature vector make opposite predictions about the output value of y i.e one is +1 or another one is -1 then;
becomes negative in the second term of function and we are subtracting it, so this adds to the sum and maximizing it. These are the examples we are looking for; the critical ones that tell the two classes apart.
b) if both feature vector make similar predictions about the output value of y i.e. both are +1 or -1 then;
becomes positive in the second term of function and we are subtracting it, this decreases the value of L. So, the algorithm downgrades the similar feature vectors that make the same prediction.
CASE 2: If two feature vectors are dissimilar then their inner product is 0. So, the term;
becomes zero. So they don’t contribute to L.
Using the optimum value of α, we compute an optimum weight such that
Now optimal separating hyperplane is given by
Summary: Briefly, dot(inner)products in the optimization decide which data points help to get optimal hyperplane. These data points are called Support vectors that shape the orientation of the hyperplane.
: Support Vector Machines — Kernels and the Kernel Trick — Martin Hofmann
I hope that this blog helped to understand SVM’s optimization using quadratic programming. Please feel free to comment down your thoughts, feedback or suggestions if any below. Thank you!