Implementation of Hard Margin SVM using Gradient Ascent from scratch, with examples [Python]

Ahmed Mirghani
6 min readMar 22, 2023

--

Photo by Vidar Nordli-Mathisen on Unsplash

In the previous article, the mathematical foundation of the hard margin SVM has been explained. At the end, the formulation boiled down to a compact cost function J(α) that meant to be maximized.

In this article, the gradient ascent method is used as a sequential maximal optimization, which means that Gradient ascent, iteritively improve the approximation of the α values that result in the maximum value of the J function. The iterative procedure of the Gradient Ascent method is given by the following update rule:

The update rule above means that for a fixed number of iterations n, at each iteration i, the value of α is calculated as the sum of the previous value of α (from the iteration i-1) and the change ratio of J(α) with respect to α. λ is called the “learning rate” in the lingo of machine learning, and it makes sure that as the algorithm gets closer to the optimal solution, the step size gets smaller, therefore, not to overshoot the ultimate solution due to unapropriately big update dJ/dα.

After the ultimate values of α’s are calculated, the vector w could be recovered by the following equation:

Remember that w and b are calculated using only the support vectors; the reason for this is explained in the section “recovering w” from the previous article.

It is enough to calculate the value of b using only one support vector, but it is commonly calculated as an average of all b values that are each associated with one support vector. Let S be the set of all support vectors, then:

  • |S| is the number of elements in S.

Now, all the elements needed to build a hard margin SVM algorithm are ready. They are incorporated in the following code snippet, which provides an OOP model of the hard margin SVM in Python.

This code has been tried on two toy datasets, one is synthetically generated using the function make_blobs from sklearn.datasets and the other is a customized Iris dataset. Let’s take a look at how the implementation above performed on both of them.

Synthetic Dataset

The function make_blobs has been used to generate data that consists of two features and one response variable. The response variable contains binary classes {0, 1} that are linearly separable and have no overlap between them; moreover, the classes have been converted to {-1, 1} to comply with the SVM algorithm setup.

The visualization of the data above looks as follows:

After several trials with the number of epochs (iterations), it has been decided to instantiate an instance of the HardMarginSVM class with 5000 epochs. Although the epochs number looks relatively high, the model ran quite fast; it took 0.06 seconds to fit the training data.

The model’s prediction score on the test data is 100% (the data is a simple one!).

After the instance svm has been fitted to the training data, the values of w and b are used to visualize the classification boundary. To be able to make this visualization, it is necessary to remember that the vertical axis is represented by the x2 variable. Therefore, it is needed to extract a formula for x2 with reference to x1, which is the variable on the horizontal axis.

Expand wx+b=0, given that the vector w = [w1, w2], the data has two features x = [x1, x2]

Following the same method as for wx+b=0, x2 for wx+b=1, wx+b=-1 looks as follows

The three variations of the x2 equations above could be combined into one function; let’s call it get_x2

Where the offset parameter 👆 could be 0, 1 or -1.

Now, by knowing the min and max values of the data in the x1 axis, the corresponding values in the x2 axis are calculated using the function get_x2. Then, for each offset, the corresponding line is constructed by connecting the two points (min_x1, get_x2(min_x1, *args)), (max_x1, get_x2(max_x1, *args)).

Customized Iris Dataset

The second example is provided on a real dataset, such that when the hard margin SVM is fitted to it and then plotted, the variables of the horizontal axis and vertical axis have semantics, not only x1 and x2 as in the case of synthetic data. In this example, the famous historical Iris dataset is used; it comes packaged within the Scikit-Learn library and is provided by the function sklearn.datasets.load_iris(*args, **kwargs). Here are five rows sampled randomly from it, along with the code that produces them:

The Iris dataset has three flower classes: ‘Setosa’, ‘Versicolor”, and ‘Virginica’ which are incoded in the target column as 0, 1 and 2 respectively. To make the data visualized in 2D, one of the classes should be removed from the target column, and only two features should be kept. To make a reasonable choice about which of the features and classes to keep, let’s visualize the relationship between the features and see how they influence the groups clustering.

From the plot above, it is decided to drop class 1, which is “Versicolor”, along with removing the features “sepal length (cm)” and “petal width (cm)”.

Now the data has two features, “sepal width (cm)” and “petal length (cm)” and, moreover, two classes in the target column, “Setosa” and “Virginica” incoded as 0 and 1 respectively. They should be converted to -1 and 1 to comply with the SVM setup. 0 → -1 and 2 → 1.

The visualization of the data after customization

From this point on, we can follow the same steps as in the previous example.

  • Instantiating a HardMarginSVM instance
  • Fit it on the data
  • visualize the classification boundary.

The result visualization will look as follows:

Conclusion:

There are many extensions that are built on top of the hard margin version of the SVM algorithm. The most prominent ones are the soft margin SVM, which accounts for the data overlapping between the data classes, and the kernelized SVM, which accounts for the non-linear separability; both of them are going to be topics of upcoming articles.

--

--

Ahmed Mirghani

"Tell me and I forget, teach me and I may remember, involve me and I learn" Benjamin Franklin