Support Vector Machine

Shubhangi Hora
4 min readNov 24, 2018

--

Photo by Tao Yuan on Unsplash

In the previous article I discussed a popular supervised learning algorithm called Random Forest, which is an extension of Decision Trees. This article talks about Support Vector Machine, which is another supervised learning algorithm that is quite popular.

Support Vector Machine can be used for both Classification (Support Vector Classifier) and Regression (Support Vector Regressor) problems, but is more commonly used for the former.

How does SVM work?

The basic premise of SVC is that is segregates the data points into classes by drawing a hyperplane.

Let’s break this down a bit. The algorithm plots the data points on a graph, and rather than plotting the actual observations it plots their coordinates, which are called support vectors (thus the name Support Vector Machine). These support vectors are then divided into classes by a hyperplane. In simple terms, when you’re dealing with only two dimensions (two features), a hyperplane is a line dividing the observations into two classes.

Source

Once the algorithm has learnt the hyperplane and it is given new data, it plots the observation’s support vector and whichever side of the hyperplane it falls in, it is assigned that class.

But how does it find the right hyperplane?

There are quite a few parameters that go into determining which hyperplane is most apt for the observations at hand.

  1. Margin:

This is the distance between the hyperplane and the support vectors closest to it on either side. Naturally, a high margin would be preferred since this implies that classes will be more accurately assigned to new data.

As can be seen above, the optimal hyperplane is that which is equally distant from the closest support vectors from each category.

2. Kernel

When dealing with a linear SVM, the margin is the only important parameter. However, not every dataset has simple relationships and may not even have data points that can be separated by a straight line.

In this case, a transformation must be applied to the data points by adding another dimension to create an optimal hyperplane. This transformation is known as a kernel function.

Source

In the graph on the left, it is impossible to separate the two classes of data points with a single line, and so using a kernel function a third dimension is introduced and the graph on the right is produced. Now, a plane can be drawn that will segregate the two classes. Dimensions are repeatedly added to a graph until a hyperplane can be drawn.

There are many different kernel functions that can be applied to a dataset — ‘linear’, ‘rbf’ (Gaussian Radial Basis Function — this is the default), ‘poly’ (Polynomial Function), ‘sigmoid’, and ‘precomputed’. To read more about SVM kernels, click here.

3. C (regularization) :

SVMs are designed to find an optimal hyperplane with a margin that divides all the data points into discrete categories. This might not always happen though, since actual data might have mislabeled data points and anomalies. A ‘soft margin’ was introduced, which allows the algorithm to ignore some data points or place them on the wrong side of the hyperplane.

C is the parameter that controls this soft margin and how much each support vector influences the hyperplane, and so basically determines how much misclassification can occur.

A high C implies that very little misclassification is allowed, which leads to a small margin hyperplane. A low C implies that a lot of misclassification is allowed, and so this creates a large margin hyperplane.

4. Gamma:

This determines which support vectors are taken into consideration while drawing the hyperplane, by deciding how far the influence of each support vector reaches.

High gamma value imply that only points close to the hyperplane are taken into consideration and low gamma values imply that far away points are taken into consideration.

For more information on C and Gamma, click here.

The above parameters can be changed to tune the SVM and find the optimal solution. This can be done by using a tuning function called GridSearchCV, found in the Sci-Kit Learn Python package sklearn.

Pros of SVMs

  • Produce a clear marking of separation between classes
  • Can work on data with high dimensions (many features)
  • Are memory efficient since subsets of data points are used (support vectors)
  • The solution is globally optimal
  • Are suitable for linearly and non-linearly separable data

Cons of SVMs

  • Can’t work efficiently on large datasets, since training takes a lot of time
  • Are inefficient when target classes overlap
  • Don’t produce direct probability estimates

To read up more on SVMs, click on the following links:

OpenCV, KDnuggets, HackerEarth.

--

--

Shubhangi Hora

A python developer working on AI and ML, with a background in Computer Science and Psychology. Interested in healthcare AI, specifically mental health!