Support Vector Machines

Sourodip Kundu
Analytics Vidhya
Published in
8 min readOct 23, 2019

Intuition behind SVM

In machine learning, support vector machines are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on the side of the gap on which they fall. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.

Cost Function

An alternative view of logistic regression

  • Start with logistic regression, see how we can modify it to get the SVM

— As before, the logistic regression hypothesis is as follows

Sigmoid Function

— And the sigmoid activation function looks like this

Sigmoid Activation Function

— In order to explain the math, we use z as defined above

  • What do we want logistic regression to do?

— We have an example where y = 1

> Then we hope hθ(x) is close to 1

> With hθ(x) close to 1, (θ^T x) must be much larger than 0

Sigmoid Activation Function

— Similarly, when y = 0

  • Then we hope hθ(x) is close to 0
  • With hθ(x) close to 0, (θ^T x) must be much less than 0

This is our Classical view of Logistic Regression

— Let’s consider another way of thinking about the problem

An alternative view of logistic regression

— If you look at cost function, each example contributes a term like the one below to the overall cost function

The overall cost function of Logistic Regression
  • For the overall cost function, we sum over all the training examples using the above function and have a 1/m term
  • If you then plug in the hypothesis definition (hθ(x)), you get an expanded cost function equation;
  • So each training example contributes that term to the cost function for logistic regression
  • If y = 1 then only the first term in the objective matters

— — — If we plot the functions vs. z we get the following graph

  • This plot shows the cost contribution of an example when y = 1 given z

— — So if z is big, the cost is low — this is good!

— — But if z is 0 or negative the cost contribution is high

— — This is why, when logistic regression sees a positive example, it tries to set θ^T x to be a very large term

  • If y = 0 then only the second term matters

— — — We can again plot it and get a similar graph

  • This plot shows the cost contribution of an example when y =1 given z

— Same deal, if z is small then the cost is low

— But if z is large then the cost is massive

SVM cost functions from logistic regression cost functions

To build an SVM we must redefine our cost functions

  • When y = 1

— — Take y = 1 function and create a new cost function

— — Instead of a curved line create two straight lines (magenta) which act as an approximation to the logistic regression y = 1 function

  • Take point (1) on the z-axis

— — Flat from 1 onwards

— — Grows when we reach 1 or a lower number

  • This means we have two straight lines

— — Flat when the cost is 0

— — Straight growing line after 1

  • So this is the new y=1 cost function

— — Gives the SVM a computational advantage and an easier optimization problem

— — We call this function cost1(z)

  • Similarly

— — When y = 0

— — Do the equivalent with the y=0 function plot

The complete SVM cost function

Influence of C in SVM

C is essentially a regularization parameter, which controls the trade-off between achieving a low error on the training data and minimizing the norm of the weights. It is analogous to the ridge parameter in ridge regression (in fact in practice there is little difference in performance or theory between linear SVMs and ridge regression, so I generally use the latter — or kernel ridge regression if there are more attributes than observations).

Tuning C correctly is a vital step in best practice in the use of SVMs, as structural risk minimization (the key principle behind the basic approach) is party implemented via the tuning of C. The parameter C enforces an upper bound on the norm of the weights, which means that there is a nested set of hypothesis classes indexed by C. As we increase C, we increase the complexity of the hypothesis class (if we increase C slightly, we can still form all of the linear models that we could before and also some that we couldn’t before we increased the upper bound on the allowable norm of the weights). So as well as implementing SRM via maximum margin classification, it is also implemented by limiting the complexity of the hypothesis class via controlling C.

Decision Boundary And The C Parameter

Let say we have some data consist of two classes as shown in the picture now we have multiple options for decision boundary one shown in (Img 1) which separates our data well but it does not look very correct as there is not enough gap between training data and decision boundary and other in (Img 2) which I think much better decision boundary than Img 1 because we are keeping a good gap between the training data points and the decision boundary, so why SVM will try to get decision boundary shown in (Img 2).

We need to understand that there are two parts of the cost function, one we are trying to minimize the cost associated with the training data labels that is CA and the other one is regularization parameter 1/2 ∑θ ².

Where A is,

So now the question is how will we get decision boundary like (Img 2) instead of (Img 1)-

What we are trying to do is θ^Tx ≥1 in case of y =1, So we are trying to increase the value θ^Tx above a certain number. Actually, to calculate θ^Tx there is another way to doing this if we just look at the distance from the decision boundary of any training data point, to find this θ^Tx is actually the same as the below equation.

θ^T x =d||θ||, where ||θ||= √∑θ²

So in the case of y=1, (θ^Tx≥1) then we have to maintain higher θ value but in case of regularization parameter θ is less, so we have to maintain the larger value of d to maintain the equation θ^Tx≥1 or so that this multiplication (d||θ||) still stays ≥1, means we have to maintain larger distance or larger value of d, and that’s exactly the case for ≤-1 as well. So basically what I am trying to say here is we want to maintain the distance as much as possible because of higher the distance the lower θ value we can actually try and reach, So thats why we end up finding decision boundary shown in (Img 2) instead of (Img 1).

N.B.

The higher value of C leads to overfitting, the lower value of C leads to underfitting, So we have to choose the optimal value of C.

Using SVM from Sklearn

Applied SVM on some dummy data

import numpy as np 
from sklearn.svm import SVC
import matplotlib.pyplot as plt
#x has two feature f1 and f2
X = np.array([[1,1],[2,1],[1,2],[1.5,1.5],[3,4],[2,5],[4,3],[7,2],[3,5],[2,6],[6,2],[3,4],[4,4]])
#y value
y = [0,1,0,0,1,1,1,1,1,1,1,1,1]
X_x1 = X[:,0]
X_x2 = X[:,1]
plt.scatter(X_x1, X_x2, c = y)
plt.show()
svcLinear = SVC(kernel='linear', C=100000).fit(X, y)
svcLinear.coef_, svcLinear.intercept_
#Let take two vlaue of f1 0 and 5
#we will find the value of f2 using f1 value by using equation theta1f1+theta2f2+theta3 = 0
#where theta1 and theta2 are coef_ and theta3 is intercept
#we will get two point by using f1 values and then add that point to get the decison boundary
x1 = np.array([0, 5])
x2 = -1 *(svcLinear.intercept_ + svcLinear.coef_[0][0] * x1)/svcLinear.coef_[0][1]
plt.plot(x1, x2)
plt.scatter(X_x1, X_x2, c = y)
plt.axis([0, 8, 0, 8])
plt.show()

To see more code on SVM check my GitHub:-

To know more about SVM click the link- Finding Non-Linear Decision Boundary in SVM

Reference:

  1. Wikipedia
  2. Standford Machine Learning course by Professor Andrew Ng
  3. Stackoverflow.

--

--