Support Vector Machines (part 1)
Linear Model
Introduction
In machine learning, support vector machines (SVMs; also, support vector networks) are supervised learning models with associated learning algorithms that analyze data and recognize patterns used for classification and regression analysis but mostly with classification analysis. It works by establishing a hyperplane between different classes of data. If the data can’t be linearly separated, SVM will try to project data into a higher dimension in away it can be separated with what is known as kernels trick.
SVM Algorithm (Linear Model)
Some of supervised machine learning algorithms requires transforming labels into numbers and SVM is one of them. This algorithm requires that the positive label (spam-email if we take email system checking as an example) has the numeric value of +1
(one), and the negative label (not_spam-email) has the value of -1
(minus one).
SVM creates a hyperplane that separates examples with positive labels from examples with negative labels called decision boundary.
The equation of the hyperplane is given by two parameters, a real-valued vector 𝑤 of the same conditionality as our input feature vector x, and a real number 𝑏 like this:
- Where the expression (𝑤𝑥) means:
- (𝑥) is a feature vector, (𝑤) is weighting vector, the scalar (𝑏) is a bias and (𝑛) is the number of dimensions of the feature vector (𝑥).
The predicted label for an input feature vector (𝑥 ) is given like this:
- Where sign is a mathematical operator that takes any value as input and returns +1 if the input is a positive number or -1 if the input is a negative number.
So, we can express the above equation as the following two constraints:
And, hence in a two-dimensional feature vectors, we can use these constraints to create another additional 2-hyperplanes their equations are:
And these equations passes through our support vectors
. (Both hyperplanes are parallel as shown below).
Fig. 1: An example of an SVM model for two-dimensional feature vectors.
SVM seek to maximize the distance between the closest examples of two classes which are defined as support vectors . This distance (2/||𝑤||) is referred as the margin between support-vectors.
The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the optimal values (𝑤) and (𝑏) to be used in prediction process
Alarge margin contributes to a better generalization
, that is how well the model will classify new examples in the future.
To achieve that, we need to minimize the Euclidean norm of (𝑤) denoted by ||𝑤|| and given by:
(As the smaller the norm ||𝑤||, the larger the distance between these two hyperplanes.)
So, the optimization problem that we want the machine to solve so that the hyperplane was equally distant from the closest examples of each class looks like this:
- Where the expression 𝑦𝑖(𝑤𝑥𝑖−𝑏)>=1 is just a compact way to write the above two constraints.
As minimizing ||𝑤|| is equivalent to minimizing (1/2||𝑤||)2 , and the use of this term makes it possible to perform quadratic programming optimization later on (the optimization process becomes more easy). The optimization problem for SVM, therefore, looks like this:
Sticking with this optimization is called hard-margin SVM.
Tip:To solve this constrained problem and get weight vector “w” we need to calculate the dual variables Lagrangian 𝐿𝑑 (α):
- Where:
- α𝑖 : is the Lagrange multiplier.
- (𝑥𝑖,𝑥𝑗) and (𝑦𝑖,(𝑦𝑗) are features and target (class) of instance “𝑖 “ in dataset
Note that above expression depends upon dot product between instances. Keep this point as we will need it in future when we talk about kernel tricks.
Dealing with Noise
Fig. 2 SVMs soft margins
For the case of noise (outliers or examples with wrong labels), SVM will try to be less sensitive to outliers in order to avoid future miss-classification as these outliers will inluence on the position of decision boundary (to keep generalization rule and avoid overfitting). So an error term will be added to the cost function in order to keep balance between increasing margin and decreasing the number of misclassified items. SVMs that follow this way in dealing with noise are so called soft-margin SVMs.
Cost function
SVM seek to maximize the distance between the closest examples of two classes which are defined as support vectors . This distance (2/||𝑤||) is referred as the margin between support-vectors.
The goal of the learning algorithm — SVM in this case — is to leverage the dataset and find the optimal values (𝑤) and (𝑏) to be used in prediction process
Alarge margin contributes to a better generalization
, that is how well the model will classify new examples in the future.
To achieve that, we need to minimize the Euclidean norm of (𝑤) denoted by ||𝑤|| and given by:
(As the smaller the norm ||𝑤||, the larger the distance between these two hyperplanes.)
So, the optimization problem that we want the machine to solve so that the hyperplane was equally distant from the closest examples of each class looks like this:
- Where the expression 𝑦𝑖(𝑤𝑥𝑖−𝑏)>=1 is just a compact way to write the above two constraints.
As minimizing ||𝑤|| is equivalent to minimizing (1/2||𝑤||)2 , and the use of this term makes it possible to perform quadratic programming optimization later on (the optimization process becomes more easy). The optimization problem for SVM, therefore, looks like this:
Sticking with this optimization is called hard-margin SVM.
Tip:To solve this constrained problem and get weight vector “w” we need to calculate the dual variables Lagrangian 𝐿𝑑 (α):
- Where:
- α𝑖 : is the Lagrange multiplier.
- (𝑥𝑖,𝑥𝑗) and (𝑦𝑖,(𝑦𝑗) are features and target (class) of instance “𝑖 “ in dataset
Note that above expression depends upon dot product between instances. Keep this point as we will need it in future when we talk about kernel tricks.
Dealing with Noise
Fig. 2 SVMs soft margins
For the case of noise (outliers or examples with wrong labels), SVM will try to be less sensitive to outliers in order to avoid future miss-classification as these outliers will inluence on the position of decision boundary (to keep generalization rule and avoid overfitting). So an error term will be added to the cost function in order to keep balance between increasing margin and decreasing the number of misclassified items. SVMs that follow this way in dealing with noise are so called soft-margin SVMs.
Cost function
- Where:
From the above cost function we derive that:
- Small C tends to emphasize the margin while ignoring the outliers in the training data, while large C may tend to overfit the training data.
- Therefore, C regulates the tradeoff between classifying the training data well and classifying future examples well (generalization).
- For every data point 𝑥𝑖, we introduce a slack variable 𝜉𝑖. The value of 𝜉𝑖 is the distance of 𝑥𝑖 from the corresponding class’s margin if 𝜉𝑖 , is on the wrong side of the margin, otherwise zero. Thus the points that are far away from the margin on the wrong side would get more penalty.
Tip This cost function is solved the same way as “Hard Margin” problem.
Conclusion:
- The main goal of the SVM is to find a hyperplane that classifies all training vectors into classes.
- There could be many hyperplanes that separate the instances of the classes, but it is desired that the chosen hyperplane maximizes the margin between the two classes.
This particular version of the algorithm builds the so-called linear model. It’s called linear because the decision boundary is a straight line (or a plane, or a hyperplane).