Support Vector Machine: The Superhero of Classification

_>SVM

Hey there! Back with another cool machine learning algorithm, and this one’s a classic — the Support Vector Machine (SVM). This algorithm is like the Batman of the ML world — sharp, precise, and incredibly powerful when it comes to classification. Let’s dive into what makes SVM tick and why it’s a go-to choice for many ML enthusiasts and professionals alike.

Photo by Evaldas Grižas on Unsplash

What’s the Buzz About?

Support Vector Machine (SVM) is a supervised learning algorithm used for classification and regression tasks. Its superpower lies in its ability to find the optimal boundary between different classes. Imagine SVM as an artist sketching a line (or a hyperplane) that best separates different categories in your data. This algorithm is particularly famous for its effectiveness in high-dimensional spaces and its versatility with different kernel functions.

The Core Idea

Picture this: You’re at a party, and there’s a clear divide between the groups on either side of the room. One group loves rock music, the other loves jazz. Your mission? Find the perfect line to separate these groups so you can recommend the right playlist to each side. That’s is what SVM does — it finds the optimal hyperplane that best divides the classes in your dataset.

Key Concepts in SVM

  1. Hyperplane:

The decision boundary that separates different classes in the feature space. In a 2D space, it’s a line; in 3D, it’s a plane; in higher dimensions, it’s a hyperplane.

2. Support Vectors:

These are the data points that are closest to the hyperplane. They’re the critical elements of the dataset, as they define the position and orientation of the hyperplane. Think of them as the guardians of the boundary.

3. Margin:

The distance between the hyperplane and the nearest data points (the support vectors) from either class. SVM aims to maximize this margin, ensuring the best possible separation between classes.

Why SVM is a Game-Changer:

  1. High Dimensionality: SVM shines in high-dimensional spaces and can handle cases where the number of features exceeds the number of samples.

2. Versatility with Kernels: By using kernel functions, SVM can efficiently perform a non-linear classification. Kernels like polynomial, radial basis function (RBF), and sigmoid transform the input space to make it linearly separable.

3. Effective in Complex Spaces: SVM is particularly effective in spaces where the data points are not linearly separable. It transforms the data into higher dimensions where a clear boundary can be found.

Kernels: The Secret Sauce

One of the coolest features of SVM is its use of kernels. Kernels allow SVM to create non-linear decision boundaries. Here are the most popular ones:

- Linear Kernel: Best used when data is linearly separable. It’s simple but sometimes not sufficient for complex data.

- Polynomial Kernel: Adds polynomial features to the data, making it useful for classifying more complex relationships.

- Radial Basis Function (RBF) Kernel: This is the go-to kernel for many practitioners. It maps the data into a higher-dimensional space, making it easier to separate with a linear hyperplane.

- Sigmoid Kernel: Acts like a neural network activation function and is used in certain situations where other kernels might not be effective.

Dealing with Soft Margins

In the real world, data isn’t always perfectly separable. That’s where the concept of soft margins comes in. SVM introduces a regularization parameter “C” that controls the trade-off between achieving a low error on the training data and having a large margin. A high C means a smaller margin with fewer misclassifications, while a low C means a larger margin but more room for misclassifications. This flexibility allows SVM to handle noisy data and avoid overfitting.

Support Vector Regression (SVR)

While SVM is primarily used for classification, it also extends to regression tasks through Support Vector Regression (SVR). SVR uses the same principles as SVM but aims to fit the best line within a threshold value, balancing between fitting the data and maintaining a flatness in the hyperplane.

For this I am only going to implement the linear SVM, but surely the kernel trick will be up on some of my blog soon!

  1. Initialization — We start by defining the class.
def __init__(self, learning_rate=0.001, lambda_param=0.1, n_iters=1000) -> None:
self.lr = learning_rate
self.lambda_param = lambda_param
self.n_iters = n_iters
self.w = None
self.b = None

learning_rate (lr): This controls how much the model's parameters (w and b) are updated during each iteration of the training process.

lambda_param: This is a regularization parameter that helps prevent overfitting by penalizing large weights.

n_iters: The number of iterations (or epochs) the training loop will run.

w and b: These are the weights (w) and bias (b) that define the decision boundary of the SVM. They are initially set to None and will be initialized during training.

2. The Training loop — Very simple implementation, model fits the hyperplane to the data and brings out the separator line.

def fit(self, X, y):
y_ = np.where(y <= 0, -1, 1)
n_samples, n_features = X.shape

self.w = np.zeros(n_features)
self.b = 0

for _ in range(self.n_iters):
for idx, x_i in enumerate(X):
condition = y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1
if condition:
self.w -= self.lr * (2 * self.lambda_param * self.w)
else:
self.w -= self.lr * ((2 * self.lambda_param * self.w) - np.dot(y_[idx], x_i))
self.b -= self.lr * y_[idx]

y_ = np.where(y <= 0, -1, 1): The labels y are converted to -1 and 1. SVM requires labels to be either -1 or 1, so any label ≤ 0 is converted to -1, and any label > 0 is converted to 1.

n_samples, n_features = X.shape: The number of samples and features in the dataset X are determined.

self.w = np.zeros(n_features): The weight vector w is initialized to zeros.

self.b = 0: The bias b is initialized to zero.

Training Loop: The model is trained for n_iters iterations. In each iteration, it updates the weights and bias based on each sample in X.

Update Rules: For each data point x_i, the condition y_[idx] * (np.dot(x_i, self.w) - self.b) >= 1 is checked:

  • Condition Met (Correctly Classified): If the condition is met, it means the data point is correctly classified with a margin ≥ 1. The weights are updated slightly to account for regularization: self.w -= self.lr * (2 * self.lambda_param * self.w).
  • Condition Not Met (Misclassified or Within Margin): If the condition is not met, the data point is either misclassified or within the margin, so the weights and bias are updated more significantly:

self.w -= self.lr * ((2 * self.lambda_param * self.w) - np.dot(y_[idx], x_i)): This update adjusts the weights to correct the classification.

self.b -= self.lr * y_[idx]: The bias is also adjusted accordingly.

3. The Predictions — Simple outright implementation of the equation.

def predict(self, X):
linear_output = np.dot(X, self.w) + self.b
return np.sign(linear_output)
  • Calculate Linear Output:

linear_output = np.dot(X, self.w) + self.b: The model calculates the linear combination of the input features X with the learned weights w and bias b.

  • Return Predictions:

np.sign(linear_output): The np.sign function returns +1 if the linear output is positive and -1 if it's negative, thus classifying the input data as belonging to one of the two classes.

That was it guys for this implementation, hope you understood stuff and will implement this algorithm on your own too!

Reference —

Happy Coding! See ya next time!

--

--