Support Vector Classifier and Support Vector Machine

5 min readNov 28, 2021

Support Vector Machine was developed in 1990 and has since gained popularity as one of the best ‘out of box’ classifiers. SVM is a computation-friendly modelling technique that is widely used in machine learning models to predict categorical data.SVM draws its foundation from the ‘Maximal Margin Classifier’.

Maximal Margin Classifier

Let me first introduce Hyperplanes that plays a major role in maximal margin classifiers.

Hyperplanes: A hyperplane for a p-dimension space is a flat affine subspace of p-1 dimensions. eg. a line is a hyperplane for 2 dimensions.

A hyperplane separates a p-dimensional space into two sections. The equation of a hyperplane in p-dimensional space is,

β0+β1X1+β2X2⋯+βpXp =0

If x∈(x1,x2,…xp) s.t β0+ β1x1+ β2x2+…. βpxp >0 or <0, then x lies on either side of the hyperplane

Let us consider an nxp matrix where rows ranges xij-xnj and columns ranges from xij-xip. Each observation yi for xi,[xi1,xi2,….xip] falls in either of two categories yi∈{-1,1}

By this approach of hyperplane separation, classification belonging to a given class can be predicted.

A test observation is assigned a class based on which side of the hyperplane it exists. If the data can be perfectly separated by a hyperplane then there will be an infinite number of hyperplanes constructed by rotating or shifting the plane and so the hyperplane must be chosen with care. A natural choice is maximal margin hyperplane, also known as optimal separating hyperplane. It is the hyperplane that is farthest from training observation. The dashed lines are called margins

Support Vectors

The three observations lie on the margin and are equidistant from the maximal margin hyperplane. They are called support vectors and moving these observations slightly will also shift the hyperplane. I essence, the name support comes from the fact that they support the hyperplane in a way that the other observations do not.

Thus, the hyperplane depends directly on a small subset of observations which is the key concept of SVM

Maximal Margin classifiers are solutions to the following optimization problem

maximize(M)(β0, β1, β2,…. βp), where M is the margin perpendicular distance from the hyperplane

such that ||βj||=1

and Yi(β0+β1Xi1+β2Xi2⋯+βpXip)≥ M, this ensures the same sign of Yi and β0+β1Xi1+β2Xi2⋯+βpXip, that implies they lie on the same side of the hyperplane. Also, the distance is constrained by M

Although this is a great way to predict class, a perfect hyperplane separating two classes is too good to be true and often times we encounter situations when there are no separating hyperplanes. Support Vector Classifier was developed to address this situation

Support Vector Classifier

Support vector classifiers are soft margin classifiers and intentionally misclassifies a few training observations to prevent overfitting. It follows a similar optimization problem with slight changes. An error term, ε is introduced which is constrained by a non-negative tuning parameter, C. The optimization problem is,

maximize(M)(β0, β1, β2,…. βp,ε1,ε2,…..εn)

st. ||βj||=1

yi(β0+β1Xi1+β2Xi2⋯+βpXip)≥ M(1- εi),

εi ≥0, Σεi ≤ C

εi’s are called slack variables that allow individual observations to be on the wrong side of the margin

if εi>0, wrong side of the margin, εi>1, wrong side of the hyperplane

C bounds sum of εi’s and is an indication of the severity of the violation of margin. It is the budget for the amount that the margin can be violated for n observations

For C>0, no more than C observations can be on the wrong side of hyperplane.

When C increases, we become more tolerant of margin violation and margin widens leading to a simpler model with a risk of high bias and lower variance

When C decreases, the margin narrows leading to overfitting lowering bias and increasing variance.

C is treated as the tuning parameter for model selection in the bias-variance tradeoff.

Support Vector Classifiers are robust to observations that lie away from hyperplane and depend on a subset near the hyperplane

Support Vector Machine

Support Vector Machine can support classification with non-linear decision boundaries. SVM addresses non-linearity by enlarging the feature space. eg: for the quadratic boundary, rather than x1,x2,…xp features SVM will consider 2p features including the quadratic term for every feature.

The solution to Support vector classifier involves only the inner products of observations, which in a sense measure the similarity between observations

f(x) = β0 +Σαi<x,xi>

αi are n parameters and are non-zero only for support vectors)

This makes computation really easy as the observations away from margin are zeroed out by alpha

Kernel

In SVM, the kernel is a function that quantifies the similarity between two observations.

Linear Kernel: k(xi,xj) = Σ<xik,xjk>, it is the inner product

Polynomial Kernel: The linear kernel can be replaced by the polynomial kernel as follows,

k(xi,xj) = (1+Σ<xik,xjk>)^d

This property enables fitting of Support Vector Classifiers in higher dimension space involving polynomials, When support vector classifier combines with non-linear kernels, it is called Support Vector Machine

Radial Kernel: It is another popular choice of the kernel that has a local behaviour in the sense that only nearby training observations have an effect on the class label of test observations. The formula is given by,

The advantage of using Kernel rather than enlarging the entire feature space is the light computation by avoiding exhausting feature space enlarged computations.

I hope this article provides a beginning overview of the support vector machine. SVMs are a very popular machine learning technique and are widely used in a variety of models.