# A Beginner’s Introduction to SVM

*While many classifiers exist that can classify linearly separable data such as logistic regression, **Support Vector Machines (SVM)** can handle highly non-linear problems using a **kernel trick** which implicitly maps the input vectors to higher-dimensional feature spaces.*

Let’s get into the depth of this in the next few minutes!

This transformation we were talking about — rearranges the dataset in such a way that it is then *linearly solvable.*

In this article, we are going to look at how SVM works, learn about kernel functions, hyperparameters and pros and cons of SVM along with some of the real-life applications of SVM. Hope you make something out of it :)

** Support Vector Machines (SVMs)**, also known as support vector networks, are a family of extremely powerful models which use method-based learning and can be used in

*classification*and

*regression*problems.

They aim at finding ** decision boundaries** that separate observations with differing class memberships. In other words, SVM is a

**formally defined by a**

*discriminative classifier*

*separating hyperplane.*## Kernel functions

*You must be wondering what the kernel functions are — Let us understand what these are in a moment:*

The figure shown below represents a 1D function using a simple 1-Dimensional example. Assume that given points are as follows, it will depict a vertical line and no other vertical lines will separate the dataset.

Now, if we consider a 2-Dimensional representation, as shown in the figure below, there is a hyperplane *(an arbitrary line in 2-Dimensions)* which separates red and blue points, which can be separated using Support Vector Machines.

As we keep increasing dimensional space, the need to be able to separate data will eventually decrease. This mapping, *x -> (x, x2)*, is called the kernel function.

In case of growing dimensional space, the computations become more complex and

kernel trickneeds to be applied to address these computations cheaply.

Hyperplanes are planes in a lower-dimensional space than the given input data that separates the data into classes. If a space is a 3-dimensional space, then the hyperplane is 2-dimensional planes. If a space is 2-dimensional, then its hyperplane is 1-dimensional lines.

## Working

Another scenario, where it is clear that a full separation of the Green and Red objects would require a curve (which is more complex than a line). Classification tasks based on drawing separating lines to distinguish between objects of different class memberships are known as *hyperplane** classifiers.*

Support Vector Machines are particularly suited to handle such tasks.

*How do we know if we’re dealing with the right hyperplane?*

*Now, let us represent the new plane by a linear equation as:*

`f(x) = ax + b`

Let us consider that this equation delivers all *values ≥ 1* from the *green triangle class* and *≤ -1* for the *gold star class.* The distance of this plane from the closest points in both the classes is at least one.* (the modulus is one).*

f(x) ≥ 1for triangles andf(x) ≤ 1or|f(x)| = 1for star

*The distance between the hyperplane and the point can be computed using the following equation:*

`M1 = |f(x)| / ||a|| = 1 / ||a||`

*The total margin is:*

*1 / ||a|| + 1 / ||a|| = 2 / ||a|*

In order to maximize the separability, we will have to ** maximize the ||a|| **value. This particular value is known as a

*weight vector.*

We can minimize the weight value which is a non-linear optimization task. One of the methods is to use the ** Karush-Kuhn-Tucker (KKT)** condition, using the

*Lagrange multiplier λi.*## Large Margin Intuition

In logistic regression, the output of the linear function is taken and the value is squashed within the range of [0,1] using the sigmoid function. If the value is greater than a threshold value, say 0.5, label 1 is assigned else label 0.

For more on

Logistic Regression—A Comprehensive Guide to Logistic Regression

In case of *Support Vector Machines*, the linear function is taken and if the output is greater than 1 and we identify it with one class and if the output is -1, it is identified with another class.

Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin.

## Cost Function and Gradient Updates

In the SVM algorithm, we maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is called the *hinge loss.*

Hinge loss function(function on the left can be represented as a function on the right)

If the predicted value and the actual value are of the same sign, the cost is 0.

If not, we calculate the loss value. We also add a ** regularization parameter** the

**. The objective of the regularization parameter is to balance the margin maximization and loss.**

*cost function**After adding the regularization parameter, the cost functions looks as below:*

Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using gradients, we can update our weights.

When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter.

When there is a misclassification, i.e our model makes a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update.

*What if a Linearly Separable Hyperplane doesn’t exist?*

Support Vector Machines can probably help you to find a separating hyperplane but *only if it exists.* There are certain cases when it is not possible to define a hyperplane, this happens *due to **noise** in the data.* In fact, another reason can be a ** non-linear boundary** as well.

The following first graph depicts noise and the second one shows a non-linear boundary.

For such problems which arise due to noise in the data, the best way is to reduce the margin itself and introduce* slack.*

The non-linear boundary problem can be solved if we introduce a *kernel. **Some of the kernel functions that can be introduced are mentioned below:*

A ** radial basis function (RBF) **is a

*real-valued function*whose value is dependent on the distance between the input and some fixed point.

In machine learning, the radial basis function kernel, or RBF kernel, is a popular kernel function used in various kernelized learning algorithms.

The RBF kernel on two samples **x** and **x’**, represented as feature vectors in some ** input space**,

*is defined as:*

## Tuning Parameters/Hyperparameters

*You do not need to tune parameter in all cases. There are inbuilt functions in Sklearn tool kit which can be used. Nevertheless, a few methods can be employed to do the same:*

*Kernels**Regularization**Gamma*

Lower value of Gamma creates a loose fit of the training dataset. On the other hand, a high value of gamma allows the model to get fit more appropriately.

*The ‘C’ and ‘Gamma’ hyperparameters*

**C** is the parameter for the soft margin cost function, which controls the influence of each individual support-vector. This process involves trading error penalty for stability.

Small C tends to emphasize the margin while ignoring the outliers in the training data

(Soft Margin), while large C may tend to overfit the training data(Hard Margin).

The **gamma** parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as a similarity measure between two points.

A small gamma value defines a Gaussian function with a large variance. On the other hand, a large gamma value define a Gaussian function with a small variance.

## Applications

*Face detection**Text and hypertext categorization**Classification of images**Bioinformatics**Protein fold and remote homology detection**Handwriting recognition**Generalized predictive control (GPC)*

## Advantages

*SVMs are very efficient in handling data in high-dimensional space**It is memory-efficient because it uses a subset of training points in the support vectors**It is very effective when the number of dimensions is greater than the number of observations*

## Disadvantages

*It is not suitable for very large data sets**It is not very efficient for data set have many outliers**It doesn’t directly provide an indication of probability estimation*

# Additional Resources and References

Hope you enjoyed and made the most out of this article! Stay tuned for my upcoming blogs! Make sure to

CLAPandFOLLOWif you find my content helpful/informative!