Support Vector Machine

Published in

Analytics Vidhya

6 min readMay 1, 2020

Support Vector Machine (SVM) is a supervised classifier and is defined by a separating hyperplane. In other words, given a set of labeled data, SVM generates an optimal hyperplane in the feature space which demarcates different classes.

Confusing, isn’t it? Let’s understand it in layman's terms.

Suppose, you have a given set of points of two types (say □ and ○) on a paper which are linearly separable. The job of SVM is to find a straight line that asserts the set into two homogeneous types, and which is also situated as far as possible from all those points.

Evidently, both straight-line ‘A’ and ‘B’ separate the two types of points as desired. However, ‘A’ is precisely situated as far as possible from all those points. SVM, as a tool, will elect ‘A’ as the separating hyperplane. In the image, the light blue periphery around lines ‘A’ and ‘B’ is called ‘Margin’. It is defined as the distance from the hyperplane to the nearest point, multiplied by 2. In simpler terms, the hyperplane will stay in the middle of the margin. The higher margin will give the optimal hyperplane.

Working of SVM

Till now, we are familiar with the process of segregating two classes with a hyperplane. But a rather important question arises is how does it work, and separate the two given classes. Don’t worry, it’s not as complicated as it appears to be.

To understand this, we need to delve into the working of Support Vector Machine or SVM. In the process, we will look at different scenarios of how a hyperplane can be constructed. Just remember a thumb rule to identify the appropriate hyperplane:

Select the hyperplane which segregates the two classes better.

In the example, we need to draw a hyperplane such that it distinguishes the two classes. We start with randomly plotting of three hyperplanes along with the data set, as shown in the graph hereunder.

Now, we attempt to adjust the orientation of the hyperplanes in such a way that it homogeneously divides the given classes. Here, all three hyperplanes (‘A’, ‘B’ and ‘C’) segregate the classes well, but a rather pertinent question is “How can we identify the appropriate hyperplane?”

To answer the question, we further try to maximize the distances between the nearest data point and hyperplane, which would help us to decide the hyperplane. This distance, as we have learned, is called ‘Margin’.

As can be seen in the graph, the margin for hyperplane ‘B’ is comparatively higher than both ‘A’ and ‘C’. Therefore, we consider hyperplane ‘B’ as the best fit.

Another pertinent reason for selecting the hyperplane with a higher margin is the degree of robustness. That is to say, if we select a hyperplane having a low margin, there are higher chances of misclassification.

In the scenario given below, it is not possible to draw a linear hyperplane to classify the given set, then how does SVM demarcate the same? (Note that so far we have only looked at the linear hyperplane.)

These types of problems are very easy for the SVM algorithm, which solves it by introducing additional features.

Plotting the transformed points in the x-y plane, we get:

Now, we can easily draw a hyperplane which differentiates the two classes.

Hyper-parameter Tuning

Kernel

The most important hyper-parameter of SVM is ‘Kernel’. Given a list of observations, it maps them into specific feature space. Generally speaking, most of the observations are linearly separable after this transformation. Note that the default value of the kernel is ‘rbf’. Now, we would quickly scroll through different types of Kernel.

1. Linear Kernel

In Linear Kernel, we only have a Cost parameter.

2. Polynomial Kernel

Here, usually, ‘r’ is set to zero and ‘γ’ to a constant value. Along with the cost parameter ‘C’, the integer parameter ‘d’ has to be tuned. The value of ‘d’ ranges from 1 and 10.

3. Radial Kernel

It is the most popularly used kernel. It outperforms every other kernel due to its flexibility of separate observations. Here, the cost parameter ‘C’ and parameter ‘γ’ have to be tuned.

4. Gaussian Kernel

In the Gaussian Kernel, we calculate an N-dimensional kernel by picking ‘n’ patterns in data space first. Then, the kernel coordinates the points by calculating its distance to each of these chosen data points and thereafter, taking the Gaussian function of the distances.

Regularization

The regularization parameter is also known as the ‘C’ parameter. It provides the SVM optimization with regards to the degrees of obviating misclassification in each training example.

If ‘C’ is very large, the optimization will choose a hyperplane with a smaller margin. Similarly, a very small value of ‘C’ will force the optimizer to select a hyperplane whose margin is large, even as the selected hyperplane misclassifies more points.

Given hereinbelow are examples of two different types of ‘C’ parameters. In the left image, the regularization value is low and hence, it has some misclassification. Whereas, a large value of ‘C’ leads to choosing of smaller margin hyperplane.

Gamma

The gamma parameter characterizes how far the impact of a single training example reaches. A low value of gamma signifies points far away from hyperplane are considered in the calculation, and vice versa.

Margin

Margin is a line that separates the closest points belonging to different classes. A margin is considered to be good if the separation is larger for both the classes, and points belonging to one class should not cross to another class.

Conclusion

Considering the limitations of SVM, it doesn’t perform well when we have large data sets as the required training time taken is higher. SVM falls short when the data has more noise or given classes are overlapping. This is because it is difficult for SVM to draw hyperplane for overlapping classes. Further, SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation.

However, on the positive front, SVM works really well with clear margins of separation and is effective in high dimensional spaces. It also performs well in cases where the number of dimensions is greater than the number of samples. It uses a subset of training points in the decision function (called ‘support vectors’), and therefore, is also memory efficient.