# Support Vector Machine.

Support vector machines (SVMs) are powerful yet flexible supervised machine learning algorithms that are used both for classification and regression. But generally, they are used in classification problems.

In the 1960s, SVMs were first introduced but later they got refined in 1990. SVMs have their unique way of implementation as compared to other machine learning algorithms. Lately, they are extremely popular because of their ability to handle multiple continuous and categorical variables.

An SVM model is basically a representation of different classes in a hyperplane in a multidimensional space. The hyperplane will be generated in an iterative manner by SVM so that the error can be minimized. The goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH).

Example: SVM can be understood with the example that we have used in the KNN classifier. Suppose we see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm.

We will first train our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange creature. So as the support vector creates a decision boundary between these two data (cat and dog) and chooses extreme cases (support vectors), it will see the extreme case of cat and dog. On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:

SVM algorithm can be used for Face detection, image classification, text categorization, etc.

# Types of SVM

**SVM can be of two types:**

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.

Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data, and the classifier is used is called a Non-linear SVM classifier.

# Linear SVM vs Non-Linear SVM

# Concepts in SVM:

**Support Vectors** − Data points that are closest to the hyperplane are called support vectors. A separating line will be defined with the help of these data points.

**Hyperplane **− As we can see in the above diagram, it is a decision plane or space which is divided between a set of objects having different classes.

**Margin **− It may be defined as the gap between two lines on the closet data points of different classes. It can be calculated as the perpendicular distance from the line to the support vectors. A large margin is considered as a good margin and a small margin is considered as a bad margin.

The main goal of SVM is to divide the datasets into classes to find a maximum marginal hyperplane (MMH) and it can be done in the following two steps −

First, SVM will generate hyperplanes iteratively that segregate the classes in the best way.

Then, it will choose the hyperplane that separates the classes correctly.

# SVM Kernels:

SVM algorithm is implemented with a kernel that transforms an input data space into the required form. SVM uses a technique called the kernel trick in which the kernel takes a low-dimensional input space and transforms it into a higher-dimensional space. In simple words, the kernel converts non-separable problems into separable problems by adding more dimensions to them. It makes SVM more powerful, flexible, and accurate. The following are some of the types of kernels used by SVM.

**Linear Kernel:**

It can be used as a dot product between any two observations. The formula of the linear kernel is as below −

K(x,xi)=sum(x∗xi)

From the above formula, we can see that the product between two vectors say 𝑥 & 𝑥𝑖 is the sum of the multiplication of each pair of input values.

**Polynomial Kernel:**

It is a more generalized form of the linear kernel and distinguishes curved or nonlinear input space. Following is the formula for polynomial kernel −

k(X,Xi)=1+sum(X∗Xi)^d

Here d is the degree of a polynomial, which we need to specify manually in the learning algorithm.

**Gaussian kernel:**

It is a general-purpose kernel; used when there is no prior knowledge about the data. Equation is:

Gaussian kernel equation:

**Radial Basis Function (RBF) Kernel:**

RBF kernel, mostly used in SVM classification, maps input space in indefinite dimensional space. It is a general-purpose kernel; used when there is no prior knowledge about the data.

The following formula explains it mathematically −

K(x,xi)=exp(−gamma∗sum(x−xi²))

Here, gamma ranges from 0 to 1. We need to manually specify it in the learning algorithm. A good default value of gamma is 0.1.

As we implemented SVM for linearly separable data, we can implement it in Python for the data that is not linearly separable. It can be done by using kernels.

**Laplace RBF kernel**

It is a general-purpose kernel; used when there is no prior knowledge about the data.

Equation is:

**Hyperbolic tangent kernel:**

We can use it in neural networks. Equation is:

**Sigmoid kernel:**

We can use it as the proxy for neural networks. Equation is :

**Bessel function of the first kind Kernel:**

We can use it to remove the cross term in mathematical functions. Equation is:

**ANOVA radial basis kernel:**

We can use it in regression problems. Equation is:

# Pros and Cons associated with SVM:

**Pros:**

· It works really well with a clear margin of separation

· It is effective in high dimensional spaces.

· It is effective in cases where the number of dimensions is greater than the number of samples.

· It uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

**Cons:**

· It doesn’t perform well when we have large data set because the required training time is higher.

· It also doesn’t perform very well, when the data set has more noise i.e. target classes are overlapping

· SVM doesn’t directly provide probability estimates, these are calculated using an expensive five-fold cross-validation. It is included in the related SVC method of the Python scikit-learn library.

# Metrics for SVM:

**1. Confusion Matrix:**

A confusion matrix is an N X N matrix, where N is the number of classes being predicted. For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. Here are a few definitions, you need to remember for a confusion matrix:

· Accuracy: the proportion of the total number of predictions that were correct.

· Positive Predictive Value or Precision: the proportion of positive cases that were correctly identified.

· Negative Predictive Value: the proportion of negative cases that were correctly identified.

· Sensitivity or Recall: the proportion of actual positive cases which are correctly identified.

· Specificity: the proportion of actual negative cases which are correctly identified.

**2. F1 Score:**

This is the harmonic mean of Precision and Recall and gives a better measure of the incorrectly classified cases than the Accuracy Metric.

We use the Harmonic Mean since it penalizes the extreme values.

To summarize the differences between the F1-score and the accuracy,

· Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

· Accuracy can be used when the class distribution is similar while the F1-score is a better metric when there are imbalanced classes as in the above case.

· In most real-life classification problems, imbalanced class distribution exists and thus F1-score is a better metric to evaluate our model.

# Applications of SVM

- Sentiment analysis.
- Spam Detection.
- Handwritten digit recognition.
- Image recognition challenges

**Reference:** Tutorialspoint, Analyticsvidhya, Data-flair