A Practical Guide to Support Vector Machines (SVM)

Published in

SFU Professional Computer Science

13 min readFeb 10, 2022

Authors: Wei Zhang, Wanyi Su, Zeyu Hu, Jinyao Lu, Yin Yu Kevani Chow

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/mpcs}.

Quick Introduction of SVM

SVM stands for “Support Vector Machine”. The SVM algorithm is a powerful supervised machine learning model designed for classification, regression, and outlier detection problems. It is an example of a linear classifier and is mostly used to solve classification. The SVM algorithm is in fact an optimization problem. For classification analysis, the objective of an SVM classifier is to find an optimal decision boundary to partition input data into multiple discrete classes and maximize the margin. If a dataset is linearly-separable, we can classify it using linear SVM. However, data in the real world is complex and usually not as simple as a linearly-separated problem. In that case, we need to implement a “kernel trick” to map the original data into a higher dimension, and then apply the vanilla linear SVM model to solve the nonlinear problem. Hard margin is also introduced. We will illustrate these in the following parts.

Here, we firstly introduce several basic concepts in SVM.

“Decision Boundary”
A linear classifier’s decision boundary is also called a “hyperplane”. Hyperplane is a (n-1)-dimensional subspace, for example, a line for observations in a 2D space and a plane for observations in a 3D space, separating different classes of data observed in a n-dimensional space. There can be multiple hyperplanes, as the figure shows below.

“Margin”
Margin is defined as the perpendicular distance of the nearest data points to the hyperplane. There are “hard margin” and “soft margin”. Our objective is to find the hyperplane with the maximum margin. For example, the middle one in the figure below.

“Support Vector”
The nearest data points on the edge of margins are support vectors.

“Kernel”
Kernel method is a computationally efficient method. It uses kernel functions to map raw input data into a higher dimension and smartly compute the dot product result without explicitly computing all the feature terms.

Figure 1. An example of SVM with multiple hyperplanes

Comparison of SVM and Logistic Regression

Since SVM and Logistic Regression are both supervised classification models and we can derive SVM’s loss function from Logistic Regression’s loss function, it is necessary to demonstrate what’s the main difference between these two classification algorithms and when to use either of them.

Main Differences:

When to use either one of the two models:

In a nutshell, linear SVM is similar to Logistic Regression and sometimes they have similar performance. However, things may differ when modeling on complex, nonlinear data. In this case, SVM with kernels has more power to solve the classification problem.

Suppose n: number of features, m: number of training data points. If n is much larger than m, use Logistic Regression, or linear SVM. For example, we have 10000 word features and only up to 1000 email samples in spam classification. If n is quite small and m is intermediate size, it’s better to choose SVM with Gaussian kernel. Another condition is when n is small and m is quite large (for example, millions of samples), it’s recommended to create or add more features and apply Logistic Regression or SVM without a kernel. (Artificial Intelligence — All in One, 2017)[1]

Mathematics behind Support Vector Machines (SVM)

If we dig into the mathematics behind SVM, we may need to spend days reading and understanding the formula.

In the coming paragraphs we are going to cover basic math concepts of SVM which are simple enough for you to understand.

Let’s go back to a classic example on drawing a straight line to differentiate the positive samples and negative samples from a space:

However, which straight line should we draw? That’s the question!
We should draw a line which has a distance to separate the positive and negative samples as wide as possible — the decision boundary.

However, can we figure it out in a mathematical way? Let’s look at the graph below:

To determine the decision rule for finding decision boundary, we assume there are two vectors :

1. Vector u: random point on the graph

2. Vector w: perpendicular to the decision boundary

Also, we denote ‘c’ as the distance of the vector from origin to the decision boundary. Here are the relationships of these values which contribute to the decision rules:

If the dot product of vector v and vector w is larger than the distance ‘c’,

then the point is a positive sample.

If the dot product of vector v and vector w is smaller than the distance ‘c’,

then the point is a negative sample.

If the dot product of vector v and vector w equals to the distance ‘c’,

then the point lies on the decision boundary.

After knowing the above relationships, we are ready to define the decision rule. We need to find whether the dot product of vector v and vector w is larger than c.

Without loss of generality, we could substitute c = -b and rewrite the rule as follow:

Hence, we have 2 rules formed for positive sample and negative sample respectively.

In order to make it more mathematical convenient, we introduce a variable yi
such that:

for positive samples and

for negative samples. As a result, we can rewrite and combine the equations as :

For those samples on the hyperplanes are exactly 0 so the formula can be rewrite as :

After all the maths, do you still remember what we are trying to do? Forgot already?

We are trying to figure out the line such that the “street” (margin) separates the positive samples and negative samples as wide as possible. Therefore, we have to optimize the width. Let’s look at the following graph indicating the positive and negative samples on the hyperplanes.

Basically, the width of vector of positive samples and vector of negative samples will be the difference between them. In order to find the width of the margin, we can take the dot product of the unit normal to the median line of the hyperplanes and the difference of the vectors. Then, this is the formula we need:

By the previous equation for the samples lying on the hyperplanes, we know that actually vector of positive samples equals 1-b and vector of negative samples equals 1+b. As a result, the width equation can be rewritten to

Finally, we have to maximize the width which means minimize ||w||. In order to introduce the mathematical convenient, it can be represented by

However, can this equation solve all the dataset classification problems?

Soft Margin

If the datasets in the real-life are separated linearly perfectly, then our life should be happy ever after. Unfortunately, we only have “almost” linearly separable dataset or even worse — the non-linearly separable dataset. Let’s work on the relatively easy one first — Soft Margin.

If we strictly impose that all instances must be off the street and on the right side, this is called hard margin classification. (5. Support Vector Machines, 2021)[3] However, it is impossible to have the result perfectly classified mostly. Also, this method could be sensitive to outliers.

Soft Margin classification provides a more flexible solution. It has two more terms added to the equation: zeta and hyper-parameter C. Zeta represents the distance of the wrongly classified point to the correct hyperplane it belongs to (L1). The distance would be zero when all the points are in the right categories. C is the hyper-parameter that controls to what extent we intend to focus on the classification error.

The new objective function consists of two parts: margin error and classification error. The objective is to find a good balance between keeping the street as large as possible and limiting the margin violations. (5. Support Vector Machines, 2021) [3]

When C is set to a small value, the model focuses more on a wide margin and allows points easily violate their hyperplane. With a larger C value, it tries to classify as many points correctly as possible and scarifies the margin width.

Soft Margin SVM works well in linear separable cases. However, it does not perform well when the datasets cannot be separated linearly . We need to introduce additional features to reasonably handle such issue. In the next section we’ll move to the magical “Kernel Trick”.

Kernels

This is the most exciting part of SVM — Kernels.

First of all, What is the kernel? How does it work?

The kernel is a set of mathematical functions. When the support vectors cannot be separated linearly, SVM introduces the kernel. Firstly, selecting a convenient and appropriate kernel function. Next, transform the support vectors to a high-dimensional space using the kernel function. The result of the transformation will be linearly separable. Finally, maximizing the interval between the kernel output and the target hyperplane.

What are the examples of kernels?[6]

There are several popular kernels. Let us see a brief introduction of them:

Linear Kernel:

It is most commonly used when a data set has a large number of features.

Equation:

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#polynomial

2. Polynomial Kernel:

Polynomial kernels are ideal for issues in which all of the training data has been standardized.

Equation: x and y are two feature vectors. a is the slope, c is a constant. d is the degree for polynomial.

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#polynomial

3. Radial Basis function Kernel:

When there is no prior knowledge of the data, this method is utilized. Furthermore, there are different kinds of RBF kernel.

3.1 Gaussian Kernel

Equation:

3.2 Exponential Kernel

Equation:

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#polynomial

3.3 Laplacian Kernel

Equation:

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#polynomial

3.4 ANOVA Kernel

Equation:

4. Hyperbolic Tangent (Sigmoid) Kernel

The Hyperbolic Tangent (Sigmoid) Kernel is derived from the science of Neural Networks, where the bipolar sigmoid function is frequently utilized as an artificial neuron activation function.

Equation:

http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#polynomial

The following image presents the classification results using various kernels.

https://gist.github.com/WittmannF/60680723ed8dd0cb993051a7448f7805

Why Support Vector Machines (SVM) is good?

Clear Margin Separation: Input of the SVM is a set of datasets and output is an optimal hyperplane. In one dimensional space, the hyperplane is a point. In two-dimensional space, the hyperplane is a line. In three-dimensional space, the hyperplane is a surface. [7]

Effective in high-dimensional Spaces: When the number of dimensions is greater than the number of samples, SVM is automatically regularized. That avoids the overfitting of the high-dimensional data.[7]

Memory Efficient: Instead of using the entire training set, SVM only uses a subset of the training dataset to train the models. The subset is called support vectors.

Effective in non-linear classification: SVM uses the kernel function to map the data into high-dimensional spaces, then separate them linearly. It can solve complex problems with an appropriate kernel function.

Straightforward classification concept: SVM maximizes the margin between the support vectors and the target hyperplane.

What are the constraints of Support Vector Machines (SVM)?

Although SVM has many advantages, disadvantages still exist

Sometimes it is difficult to choose an appropriate Kernel function: Choosing an appropriate Kernel function is not an easy task when the data is non-linear. It could be tricky and complex. Especially when using a high dimension Kernel, you might generate too many support vectors which could lead to drastically reduce in the training speed. [2]

Extensive memory requirement: Algorithmic complexity and memory requirements of SVM are very high. A lot of memory would be needed as you have to store all the support vectors in the memory and this number grows abruptly with the size of both training and testing datasets.

Requires Feature Scaling: One must do feature scaling of variables before applying SVM.

Long training time: Although SVMs have good generalization performance, they can be abysmally slow in test phase. Sometimes SVM takes a long time on training especially on large datasets.

Difficult for interpretation and explanation : SVM is not a probabilistic model so we can not explain the classification in terms of probability. So the SVM model is difficult to understand and interpret by human beings unlike Decision Trees.

Practical example

We implement SVM models with a fraud detection problem on insurance claims with Python3. It’s a binary classification with features such as incident type, client age, education level etc. The original data source is https://www.kaggle.com/roshansharma/insurance-claim.

Linear kernel:
Accuracy Score: 0.83 
F1 Score: 0.6851851851851852Polynomial kernel with degree 2
Accuracy Score: 0.795 
F1 Score: 0.6306306306306307RBF kernel with gamma:
Accuracy Score: 0.82 
F1 Score: 0.6603773584905661

Because it is a typical fraud detection problem with imbalance classes, we mainly compare the f1 score, which enables us to emphasize on the accuracy of the positive case (the fraud). Polynomial kernel performs under expected in both accuracy and f1 score. While surprisingly linear kernel achieves higher result than the RBF kernel does. For more hyperparameter tuning, we recommend techniques such as Grid Search to achieve stronger predictive ability.

Real-life applications

SVM is widely used in real life business scenarios in multiple industries, especially in the non-linear and unstructured data classification field. Some examples of SVM classifier application are as below:

Image recognition. For example, face detection, hand-written characters classification.

Text and hypertext classification. For example, document topic recognition.

Security-related classification. For example, anomaly detection in network security, fraud detection in banking (like our practical example in the previous part).

Geo-spatial classification. For example, satellite data classification.

Biology-related recognition. For example, cancer classification, protein remote homology detection.

SVM regressor is also useful in the business world. Support vector regression (SVR) method, which is a popular regression form of SVM, can be applied in prediction, for example, stock market forecast.

Conclusion

This essay has introduced what is SVM. To sum up, SVM is a supervised learning algorithm which separates the data into different classes through the use of a hyper-plane. The chosen hyper-plane is one with the greatest margin between the hyper-plane and all points, this yields the greatest likelihood of accurate classification. The essay has also compared SVM with Logistic Regression in several aspects. More importantly, it gives mathematical insights of SVM in both graphical and formula ways. What is more, both advantages and drawbacks of SVM are presented, as well as a variety of applications that we use in our real life.

Reference

[1] Artificial Intelligence — All in One. (2017). Lecture 13.1 — Clustering | Unsupervised Learning | Introduction — [ Andrew Ng ] [YouTube Video]. On YouTube. https://www.youtube.com/watch?v=Ev8YbxPu_bQ&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN&index=76

[2]. Auria, Laura, and R. A. Moro. “Support Vector Machines (SVM) as a Technique for Solvency Analysis.” SSRN Electronic Journal, 2008, core.ac.uk/download/pdf/6302770.pdf, 10.2139/ssrn.1424949.

[3] 5. Support Vector Machines. (2021). 5. Support Vector Machines. O’Reilly Online Learning. https://learning.oreilly.com/library/view/hands-on-machine-learning/9781492032632/ch05.html#idm45022165153592

[4] sklearn.preprocessing.MinMaxScaler. (2021). Scikit-Learn. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

[5] nirajvermafcb. (2017, February 24). Support Vector Machine detail analysis. Kaggle.com; Kaggle. https://www.kaggle.com/nirajvermafcb/support-vector-machine-detail-analysis#Data-Standardisation

[6]Kernel Functions for Machine Learning Applications — César Souza. (2010, March 17). Crsouza.com. http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/#polynomial

[7]SVM | Support Vector Machine Algorithm in Machine Learning. (2017, September 12). Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/09/understaing-support-vector-machine-example-code/

‌

A Practical Guide to Support Vector Machines (SVM)

Quick Introduction of SVM

Comparison of SVM and Logistic Regression

Mathematics behind Support Vector Machines (SVM)

Why Support Vector Machines (SVM) is good?

What are the constraints of Support Vector Machines (SVM)?

Practical example

Real-life applications

Conclusion

Reference

Written by Happy Data