Step By Step Mathematical formulation of Hard Margin SVM

9 min readMar 19, 2023

Support Vector Machine (SVM) is a non-parametric classification algorithm that is based on geometric representation of binary, linearly separable data. Although there are several extensions of SVM that handles data with multiclasses, overlapping data and the case of non-linear separability of data, this article is going to treat only the basic setup of the problem, in which, there are two groups that data could be classified into. Moreover, these two classes should not overlap geometrically and there should be a clear distance between the classes(a margin) that includes no data points from any of them (this is why the basic form of SVM is called the hard margin SVM), then a hyper-plane (a generalization of a line, a line in 2D, a plane in 3D and a hyper-plane from 4D and further) could be drawn in the middle of this margin to form a “classification boundary” between the two classes of data. SVM algorithm defines a systematic approach to approximate the parameters that define this hyperplane.

Objectives

This article aims to provide a step by step mathematical formulation of the hard margin SVM from the ground up, leaving the extensions of SVM to be discussed in upcoming articles.

Data definition

First thing first, the data that is going to be used in this demonstration, has only 2 features; this formation enables us to make visualization in 2D, which makes communicating the geometrical concepts that SVM relies on much easier, thereafter all the concepts in 2D generalizes well in higher dimensions. Data may look like the following:

Tabular data, consists of three columns and n rows — Tabular data

Each row in the table above consists of two components, the first component is a point 𝑋𝑖 in the space 𝑅² (data from the columns 𝑥1, 𝑥2) such that 𝑋𝑖=(𝑥_𝑖1, 𝑥_𝑖2), ∀i=1, 2, …, n and the second component is the group 𝑦𝑖 to which 𝑋𝑖 belongs to. In our basic case, 𝑦𝑖 includes only two groups, the groups could be represented numerically by any two whole numbers but let them for convenience be 𝑦𝑖 ∈ {−1, 1}.

The data when plotted in 2D should be geometrically linearly separable, therefore, it would look something like in the following plot.

The general form of line equation

As aforementioned, the goal of SVM is to figure out a line that separate the two groups of data in the best way. We will elaborate more on how this best way works in SVM in the upcoming sections.
Now, it is important to note that the line equation used to describe the separation line, that we are after, is not the commonly used slope-intercept form (𝑦=𝑚𝑥+𝑐). The reason is, the slope-intercept form fails to describe a vertical line, which might be a candidate for the decision boundry in SVM.

Thus, we instead use the general form of line-equation, 𝐴𝑥+𝑏𝑦+C=0 that can describe a line 𝑥=𝑎 by setting 𝐴=1, 𝐵=0 and 𝐶=-𝑎. The general form has another helpful characteristic, it tells if a point fall above or under a line. e.g. if 𝐿 is a line that has the following general form 𝑎1𝑥+𝑏1𝑦+𝑐1=0 and 𝑝=(𝑥1,𝑦1) is a point, then by plugging 𝑝 in 𝐿 we get a value 𝑣=𝑎1𝑥1+𝑏1𝑦1+𝑐1, if 𝑣>0, it means, the point 𝑝 is above the line 𝐿, 𝑣<0 means 𝑝 is below 𝐿, finally if 𝑣=0, it means 𝑝 is on 𝐿.
Now let’s redraw the previous figure, by intuition, add a line between the two groups of data, label the points in the group of data above the line as (+) and the points in the group of data below the line as (-). A line (w_1 x_1+ w_2 x_2 + b = 0) could be expressed in a vector notation as 𝑤𝑥+𝑏=0.

SVM approach

In the introduction, we mentioned that hard margin SVM finds a clear margin that includes no data points and situates a separation line in the middle of it. The question is, how to define this margin? ; because there are several margins could be formed and therefore many classification boundary candidates.

The SVM approach, involves finding two parallel lines that each of them goes through at least one edge point of each group of the data, and the best pair of lines is the one that has the largest distance between them, this distance is called “the margin”, then the separation line is situated in the middle between these two parallel lines and it is parallel to them. This approach is called the widest street approach, and the edge points of each data group that the parallel lines go through are called the “support vectors”.

As we already mentioned that the hard margin SVM does not allow for points inside the margin. This implies that, for each data point 𝑋𝑖 not only 𝑤𝑥+𝑏>0 if 𝑦𝑖=1 or 𝑤𝑥+b<0 if 𝑦𝑖=−1, but there should be a threshold “𝑎” such that 𝑤𝑥+𝑏≥a if 𝑦𝑖=1 and 𝑤𝑥+𝑏≤-𝑎 if 𝑦𝑖=−1.
Naturally, 𝑎 is chosen to be 1, because for each data point (xi, yi) either wx_i+b≥1 if yi=1 or wx_i+b≤1 if yi=-1 therefore the upper line is 𝑤𝑥+𝑏=1 and the lower line is 𝑤𝑥+𝑏=−1.

Margin width formulation

If we have a formula that describe the margin width, involving w, b, then we can maximize it and derive w, b that define two parallel lines wx+b=1, wx+b=-1 that each of them go through one or more edge points of its corresponding class of data and they have maximum width between them. To get to this point, we have to consider the following.

1- w is perpendicular to the classification boundary wx+b=0. This could be proved by taking two points x0, x1 on wx+b=0,

⇒ wx0+b = 0, wx1+b = 0 ⇒ by subtraction: w(x0-x1) = 0 ⇒ w ⊥ (x0-x1)

because x0 , x1 are arbitrary points on wx+b=0 ⇒ w ⊥ wx+b=0

We can use the fact that w is perpendicular to wx+b=0 to calculate the distance from the line 𝑤𝑥+𝑏=0 to the line 𝑤𝑥+𝑏=1 or the line 𝑤𝑥+𝑏=−1 as multiples of the unit vector of 𝑤 (w/||w||), starting from a point 𝑥∗ on the line 𝑤𝑥+𝑏=0. The distance from wx+b=0 to either wx+b=1 or wx+b=-1 is equal, because wx+b=0 is situated in the middle between them. Let’s choose to calculate the distance from the 𝑤𝑥+𝑏=0 to the 𝑤𝑥+𝑏=1 and denote this distance as 𝑘. Therefore, the corresponding point 𝑥+ on the line 𝑤𝑥+𝑏=1, which is at the other end of a perpendicular line from wx*+b=0, could be calculated as:

We calculated the distance from wx+b=0 to wx+b=1 which is a half of the margin distance.

margin formula

So far, we have formulated a measure for the margin, and we have to maximize it, as the SVM approach instructs.

The two constraints above could be merged into one as follows:

We choose to minimize ½||𝑤||² for mathematical convenience.

Lagrange multipliers

The Lagrangian multipliers method is used to include a minimization problem along with its constraints into a one function called (the Cost function).
The general form of the Lagrangian multipliers method is,

minimize(f(w)) s.t g(w) ≤ 0, h(w) = 0, Then:

There is no equality constraint in the setup of hard margin SVM, Therefore:

Now There is one Cost function J(w, b, α) to minimize w.r.t (w, b, α). But There is a Lemma, from the Lagrangian multipliers method, we have to go through, it implies the following:

⇒ Therefore, the primal problem from the Lagrangian multipliers method can be expressed as follows:

primal Lagrangian problem

The Dual problem

The minimization/maximization operations could be performed in the opposite order ( minimize J(w, b, α) w.r.t (w, b) then maximize it w.r.t (α) ) and get to the same solution of the problem, this procedure is supported by two theorems, the weak duality and the strong duality theorems. After flipping the operations (minimization first, then maximization), the problem is called the dual problem.

The dual problem

The weak duality theorms implies d* ≤ p*
The strong duality theorm implies, iff there exists a saddle point of 𝐽(𝑤,𝑏,𝛼) ⇔ d*= p* , and this saddle point satisfies the Karush Kuhn Tucker KKT condition which consists of the following terms.

Let’s deffirentiate J w.r.t w and b

Now expand the function 𝐽 and substitute the terms we got from the derivatives above.

By substituting w in J, we get

Matrix notation

⊙ is an element-wise product.

⇒

Now the dual problem is solved only for α

A solution for this maximization problem could be found using the gradient ascent method, or turn it into a minimization problem and solve it by using quadratic programing.

Recovering w

After solving for 𝛼, we can recover w by using

Generally there are few non-zero 𝛼’s which are related to the support vectors (remember that support vectors are the points on the lines wx+b=1 and wx+b=-1); this is from KKT condition where

⇒

Only the points on the lines wx-b=1 and wx+b=-1 will yield y_i(wx_i+b)=1, and from the KKT condition above, it is the only case where 𝛼_i could be non-zero. Therefore, w will be recovered solely from the support vectors and their related 𝛼’s.

In this article, we formulated the basic case of SVM (hard margin SVM) mathematically. The formulation boiled down to a compact cost function written in matrix notation that could be used to calculate the values of 𝛼’s and w programmatically.

References:

1- MIT OCW Lecture on SVM, [Video].

2- NPTEL online course, SVM: The Dual Formulation, [video]

2- Probabilistic Machine Learning, An Introduction. Kevin P. Murphy [Book].