What Is SVM?

Published in

The Startup

7 min readSep 9, 2020

Support Vector Machine (SVM) is an approach for classification which uses the concept of separating hyperplane. It was developed in the 1990s. It is a generalization of an intuitive and simple classifier called a maximal margin classifier.

To study Support Vector Machine (SVM), we first need to understand what is maximal margin classifier and support vector classifier.

In the maximal margin classifier, we use a hyperplane to separate the classes. But, What is a hyperplane? Consider we have a p-dimensional space, a hyperplane is a flat affine(does not necessarily pass from the origin)subspace of dimension p-1. For example, in a two-dimensional space, a hyperplane is a one-dimensional flat subspace, which is nothing but a line. Similarly, in a three-dimensional space, a hyperplane is a two-dimensional flat subspace that is nothing but a plane. Figure 1 illustrates a hyperplane in two-dimensional space.

Source: An Introduction to Statistical Learning with Applications in R, book by Robert Tibshirani, Gareth James, Trevor Hastie and Daniela Witten.

From Figure 1, we can also see this hyperplane as a line dividing the space into two halves. Therefore, it can act as a decision boundary for classification. For example, in the right-hand panel of Figure 2, the points above the line belong to the blue class, and the points below the line belong to purple.

In general, if our data can be perfectly separated using a hyperplane, then there will in fact exist an inﬁnite number of such hyperplanes. This is because a given separating hyperplane can usually be shifted a tiny bit up or down, or rotated, without coming into contact with any of the observations. Three possible separating hyperplanes are shown in the left-hand panel of Figure 2. To construct a classiﬁer based upon a separating hyperplane, we must have a reasonable way to decide which of the inﬁnite possible separating hyperplanes to use. A natural choice is the maximal margin hyperplane (also known as the optimal separating hyperplane), which is the separating hyperplane that is farthest from the training observations. That is, we can compute the (perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the observations to the hyperplane, and is known as the margin. The maximal margin hyperplane is the separating hyperplane for which the margin is largest — that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation based on which side of the maximal margin hyperplane it lies. This is known as the maximal margin classiﬁer. Figure 3 shows the maximal margin hyperplane on the data which is used in Figure 2.

Comparing the right-hand panel of Figure 2 to Figure 3, we see that the maximal margin hyperplane shown in Figure 3 does indeed result in a greater minimal distance between the observations and the separating hyperplane — that is, a larger margin. In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes. Examining Figure 3, we see that three training observations are equidistant from the maximal margin hyperplane and lie along the dashed lines indicating the width of the margin. These three observations are known as support vectors since they are vectors in p-dimensional space (in Figure 3, p = 2) and they “support” the maximal margin hyperplane in the sense that if these points were moved slightly then the maximal margin hyperplane would move as well. Interestingly, the maximal margin hyperplane depends directly on the support vectors, but not on the other observations: a movement to any of the other observations would not aﬀect the separating hyperplane, provided that the observation’s movement does not cause it to cross the boundary set by the margin.

The maximal margin hyperplane requires the existence of a separating hyperplane but, in many cases, no separating hyperplane exists. That is, we cannot exactly separate the two classes using a hyperplane. An example of such is given in Figure 4.

In this case, we cannot exactly separate the two classes. However, we can extend the concept of a separating hyperplane to develop a hyperplane that almost separates the classes, using a so-called soft margin. The generalization of the maximal margin classiﬁer to the non-separable case is known as the support vector classiﬁer. In Figure 4, we see that observations that belong to two classes are not necessarily separable by a hyperplane. In fact, even if a separating hyperplane does exist, then there are instances in which a classiﬁer based on a separating hyperplane might not be desirable. A classiﬁer based on a separating hyperplane will necessarily perfectly classify all of the training observations; this can lead to sensitivity to individual observations. An example is shown in Figure 5.

The addition of a single observation in the right-hand panel of Figure 5 leads to a dramatic change in the maximal margin hyperplane. The resulting maximal margin hyperplane is not satisfactory — for one thing, it has only a tiny margin. This is problematic because as discussed previously, the distance of an observation from the hyperplane can be seen as a measure of our conﬁdence that the observation was correctly classiﬁed. Therefore, it could be worthwhile to misclassify a few training observations to do a better job of classifying the remaining observations. The support vector classiﬁer, sometimes called a soft margin classiﬁer, does exactly this. Rather than seeking the largest possible margin so that every observation is not only on the correct side of the hyperplane but also on the correct side of the margin, we instead allow some observations to be on the incorrect side of the margin, or even the incorrect side of the hyperplane. (The margin is soft because it can be violated by some of the training observations.) An example is shown in the left-hand panel of Figure 6.

Most of the observations are on the correct side of the margin, but a small set of observations are on the wrong side of the margin(observations 1 and 8). An observation can not only be on the wrong side of the margin but also the wrong side of the hyperplane. In case no hyperplane exists then such a scenario is inevitable. Observations that correspond to the wrong side of the hyperplane are observations that are misclassified by the support vector classifier. Observations that lie on the correct side of the margin do not affect the support vector classifier, but the observations that lie directly on the margin, or the wrong side of the margin for their class, are known as support vectors. These observations do aﬀect the support vector classiﬁer.

The support vector classiﬁer is a natural approach for classiﬁcation in the two-class setting if the boundary between the two classes is linear. However, in practice, we are sometimes faced with non-linear class boundaries. For instance, consider the data shown in the left panel of Figure 7.

A support vector classiﬁer or any linear classiﬁer will perform poorly here. Indeed, the support vector classiﬁer shown in the right-hand panel of Figure 7 is useless here.

We can address this problem by enlarging the feature space using quadratic, cubic, or higher-order polynomial functions of the features. The support vector machine (SVM) is an extension of the support vector classiﬁer that results from enlarging the feature space in a speciﬁc way, using kernels. For example, In Figure 7 we have two features X1 and X2, rather than using these features as it is for classification, we can also include higher degree terms of these features. Such as, X1² and X2², which will give us a quadratic polynomial whose solution is nonlinear. We use kernels for doing exactly this in an efficient manner, where we specify what kind of decision boundary to use. For example linear, polynomial (with some degree), and radial. Figure 8 illustrates the use of polynomial and radial kernel on the data of Figure 7.

On the left-hand panel, we used a polynomial kernel with degree 3 and on the right, we have used a radial kernel. Both the kernels resulted in a more appropriate decision rule. The mathematics of how decision boundaries and kernels are obtained are too technical to discuss here.

Reference: An Introduction to Statistical Learning with Applications in R, book by Robert Tibshirani, Gareth James, Trevor Hastie and Daniela Witten.

What Is SVM?

Written by Saurav Jadhav