Support Vector Machine
What is Support Vector Machine?
“Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems.
It can provide more complex models that can go beyond linear decision boundaries.
With real world data, many classification problems aren’t easy. With different classes located in future space in a way that a linear hyperplane can’t act as an effective classifier.
In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a coordinate. Then, we perform classification by finding the hyper-plane that differentiate the two classes very well .
Identify the right hyper-plane
But we can have multiple lines as shown below that can separate the data points. How do we choose the best line that separates the dataset into two classes?
Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.
SVM finds the hyperplane to maximizes the margin between support vectors of the two classes. Hyperplane are decision boundaries classifying the dataset while maximizing the margin.
What are support vectors?
Support vectors are the data points in the dataset that are nearest to the hyperplane. Removing support vectors will alter the hyperplane separating two classes. Support vectors are critical elements of the dataset as SVM is built on them.
Support vector machine has two main objectives
· Find a hyperplane(line) that linearly separates the data points into two classes
· Maximize the margin between support vectors of the two classes
Let us try to understand this with few Examples:
Case 1: A simple 1-dimensional classification problem for the classifier where the classes are linearly separable
Here, the line is engineered such that the maximum margin boundary happens to be at x=0.
This line helps separate two classes of data. Support vector machines helps to find a hyperplane (line) to linearly separate data points into two classes.
A harder 1-dimensional classification problem for the classifier where the classes are no longer linearly separable
A simple linear decision boundary just doesn’t have enough expressive power to classify all these points correctly.
So, what can be done here?
One very powerful idea is to transform the input data from a 1-dimensional space to a 2-dimensional space.
Where the second dimension is nothing but square of first dimension. The data transformation makes it possible to solve this with a linear classifier.
If we take the inverse of the transformation we just applied and bring the data points back to our original input space, we can see that the linear decision boundary in the 2-dimensional space corresponds to the two points where a parabola crosses the x-axis.
Now, let’s move on from a 1-dimensional problem to a 2-dimensional problem.
Case 2: Here, we have two classes each of which has two features, x0 and x1
Again, this looks impossible for a linear classifier, which in 2-dimensional space is a straight line, to separate the points with any degree of accuracy.
Now, mapping a 2-dimensional problem on a 3-dimensional feature space to make it linearly separable.
This transformation acts to shape the points into a paraboloid around (0,0).
Now, the white points since they are close to (0,0) get mapped to points with higher vertical Z values, that new third feature that are close to 1. While the black points which are further from (0,0) get mapped to points with Z values that are either close to 0 or even negative.
Inverse of the transformation:
There are lots of different possible transformations we could apply to the data. And there are different kernels available for different transformations.
Here, we are going to focus mainly on what’s called the radial basis function kernel (RBF) and Polynomial Kernel.
What is a Kernel Function?
The Kernel Function in an SVM tells us, given 2 points in the original input space, what is their similarity in the new feature space.
Radial Basis Function Kernel/Gaussian Kernel: The similarity between two points and the transformed feature space is an exponential decaying function of the distance between the vectors and the original input space as shown by the formula here,
So, just as we saw with the simple 1D and 2D examples earlier, the kernelized support vector machine tries to find the decision boundary with maximum margin between classes using a linear classifier in the transformed feature space and not in the original input space.
The linear decision boundary learned in transformed feature space by linear SVM corresponds to non-linear decision boundary in the original input space.
Here ‘K’ stands for the RBF kernel. The x and x’ are vectors where x’ is a “point of reference” vector. The exp() function means “math constant ‘e’ raised to a power”. The || term is Euclidean distance. The sigma is a free parameter.
The gamma parameter is the inverse of the standard deviation of the RBF kernel (Gaussian function), which is used as similarity measure between two points.
Gamma(γ) defines how far the influence of a training example reaches.
· Small gamma means a large similarity radius. So that points farther apart are considered similar. Which results in more points being grouped together and smoother decision boundaries.
· Larger values of gamma mean the points must be very close to be considered similar. This results in more complex, tightly constrained decision boundaries.
Regularization parameter(c): The Regularization parameter tells the SVM optimization how much you want to avoid miss classifying each training example.
For larger values of ‘C’, a smaller margin will be accepted if the decision function is better at classifying all training points correctly. A lower ‘C’ will encourage a larger margin, therefore a simpler decision function, at the cost of training accuracy.
Polynomial Kernel: It takes additional parameter ‘degree’ that controls the model complexity and the computational cost of the transformation.
Python Implementation :
Here, I have used a Linear Kernel, as I was working on Text Classification problem. Most of text classification problems are linearly separable.
And Grid Search function to select the best value of gamma.
Model : SVM
"""to select best value of gamma using gridsearch """
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_scoreclf = SVC(kernel='linear',C=0.1)
grid_values = {'gamma':[0.001,0.01,0.05,0.1,1,10,100]}
grid_clf = GridSearchCV(clf,param_grid=grid_values)
grid_clf.fit(X_train, y_train)
y_decision_auc=grid_clf.decision_function(X_test)print('Review')
print('Test set AUC :',roc_auc_score(y_test,y_decision_auc))
print('Grid best parameter(max.AUC):',grid_clf.best_params_)
print('Grid best score(AUC):',grid_clf.best_score_)
Pros:
1. Can perform well on a range of datasets.
2. Versatile: Different kernel functions can be specified or custom kernels can be defined for specific data types.
3. Works well for both low and high dimensional data.
4. Robust against outliers (controlled using C).
Cons:
1. Efficiency (run time speed and memory usage) decreased as training set size increases.
2. Needs careful normalization of input data and parameter tuning.
3. selecting right kernel for non-linear boundaries can be tricky.
4. Difficult to interpret why a prediction was made.
Thank you for reading…