AI Sidewalk #3: Delving Into Binary SVMs(Part 2)

Shantanu Phadke
Jul 10 · 5 min read

Last Time

Here’s a link to the part 1 article: https://medium.com/@ssp247/ai-sidewalk-2-delving-into-binary-svms-part-1-d304907dd520

Lagrangian Duality Explored

Last time we saw the following optimization:

and its corresponding dual problem:

This section first goes into the details of general Lagrangian and then the . details of how we go about converting the first minimization into the second maximization.

The Lagrangian

In general the Lagrangian is just a way of solving optimization problems. Given the general optimization

Its corresponding Lagrangian would be

Duality

In general suppose we are given an optimization in the form

If we switch the maximization and minimization operations we end up with

Here g(y) is the dual function of f(x,y).

Duality Gap and Strong Duality

Theoretically, if the solution to the original optimization from above is s1 and the solution to its dual is s2, we define the notion of optimal duality gap as being the difference between these two solutions (s1-s2). Furthermore, strong duality is when the solutions to the primal and dual are equal (s1=s2).

KKT (Karush-Kuhn-Tucker) Conditions

Given that strong duality exists, we can use the Karush-Kuhn-Tucker conditions in order to solve for the duality we derive.

  • 1st KKT Condition (Stationarity): Given that (x*, \lambda*) is the optimal solution for our Lagrangian and that strong duality holds, the first KKT condition would hold true:
  • 2nd KKT Condition (Complementary Slackness): Since we are given that g_{i} ≤ 0 and lambda_{i} ≥ 0 for all i via primal feasibility, and taking into account the first condition from above we derive the following conditions:

Calculating Duality for Soft-Margin SVMs

First we can start off by converting all of the conditions into the general form we have seen above:

Now we plug in these derived inequality constraints into the Lagrangian formula as follows:

The base-level optimization for the Lagrangian is:

Just to make the notation clear, here epsilon, alpha and beta are n-element vectors of each of the individual alpha, beta and epsilon values:

If we switch the maximization and minimization, we can calculate the dual function of the Lagrangian:

We can now simply minimize the function g by carrying out the usual optimization steps. The first thing to do is to calculate the proper derivatives and set them equal to zero:

Notice that solving for w in the first expression above is rather tedious. Substituting the L2-norm of w with half of the square of the L2-norm of w in the original optimization is often used as a work-around for calculating a cleaner solution without changing the nature of the problem (we are still focused minimizing w).

After this tweak, the following is the optimal value we get for w:

Instead of providing us any optimal values, the other derivatives provide us a set of constraints for that have to be met for the second (outer) optimization:

Do you notice something about what we have derived above, and the dual optimization at the very top of the post? If you see some similarity, you aren’t mistaken! We have this outer optimization problem after plugging in the constraints from above:

This objective is the exact same as the objective from above! Finally, let’s try reasoning out the constraints. Well, the first constraint has already been derived by us earlier:

and for the second constraint we just need to manipulate an existing condition and utilize the facts that C needs to be greater than or equal to 0 (where C=0 leads to a Hard-Margin SVM!) and the betas also need to be greater than or equal to 0:

Extension

What is the Stationarity KKT Condition for the above case? What about the Complementary Slackness KKT Condition?

Big Picture

The whole point of going through the calculations above we to go through the math behind transforming an optimization involving many variables into one involving only a single vector. This result is much easier to optimize than what we started off with!

Brief Intro to Kernels

There will be a separate post in the future covering the concept of kernels in more depth, but essentially kernels are just a means of helping us “transform” the given data to be more linearly separable.

Specifically in the context of SVMs, there are several different types of kernels that are commonly utilized:

Polynomial Kernel

Gaussian Kernel

Radial Basis Function (RBF) Kernel

Next Time

Next time we will code up a binary SVM from scratch, train and test it on the University of Wisconsin Breast Cancer Dataset, and perform an analysis of the results!

Shantanu Phadke

Written by

I’m a Software Engineer as well as a Machine Learning and Artificial Intelligence Enthusiast and I mainly write about my explorations into these topics.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade