10 Must-Know Models for ML Beginners: Support Vector Machines

The Ultimate Classifier

8 min readFeb 1, 2024

This article is part of the series 10 Must-Know Models for ML Beginners.

Introduction

Support Vector Machines (SVMs) are a robust and elegant tool within the machine learning arsenal. SVMs are especially well-suited for classification tasks, where the goal is to learn a decision boundary separating data points into distinct categories. In this blog post, we’ll demystify SVMs, exploring their inner workings, advantages, and practical applications.

What is a Support Vector Machine?

At its core, an SVM is a supervised machine learning algorithm that aims to find the optimal hyperplane (a decision boundary) that maximizes the margin between different classes of data points. Let’s dissect this:

Hyperplane: In an n-dimensional space, a hyperplane is a flat subspace with one dimension less than the ambient space. Think of it as a line in 2D or a plane in 3D. Read my previous blog post to learn more about the mathematics of hyperplanes.
Margin: The margin is the distance between the hyperplane and the closest data points from each class. These closest points are the pivotal support vectors.

Why Use an SVM?

SVMs offer a compelling set of advantages:

Robustness to Outliers: SVMs prioritize maximizing the margin, focusing on the support vectors. This makes them less susceptible to the influence of outliers.
High-Dimensional Effectiveness: SVMs excel in situations with many features (dimensions). They can effectively handle complex relationships within data.
Versatility: The “kernel trick” is a powerful technique that allows SVMs to perform non-linear classification. With kernels, data can be implicitly transformed into a higher-dimensional space where a linear decision boundary can be established.
Regularization: SVMs have built-in mechanisms for regularization, which helps prevent overfitting.

How Does an SVM Work?

Data Representation: Each data point is represented as a vector in an n-dimensional space, where n is the number of features.
Finding the Hyperplane: The SVM algorithm seeks to find the hyperplane that best separates the data points while maximizing the margin.
Optimization: The task of finding the optimal hyperplane boils down to a mathematical optimization problem. Convex optimization techniques are used to solve for the hyperplane parameters.
Support Vectors: The hyperplane’s position is ultimately determined by only a subset of the data points, the support vectors, which lie closest to the decision boundary.
Kernels (Non-Linear Classification): For datasets that are not linearly separable, the kernel trick projects the data into a higher-dimensional space, potentially making linear separation possible. Common kernel choices include linear, polynomial, and radial basis function (RBF) kernels.

The Kernel Trick

The kernel trick in Support Vector Machines (SVMs) is a method that allows us to operate in a high-dimensional space without explicitly mapping the data points to this space. This might seem a bit abstract at first, so let’s break it down to understand how this projection into a higher-dimensional space occurs conceptually.

Conceptual Understanding of High-Dimensional Projection

1. Original Feature Space: Consider a dataset in a low-dimensional space, like 2D, where the data points are not linearly separable. In other words, you can’t draw a straight line to perfectly classify the data points into their respective categories.

2. Transformation to Higher-Dimensional Space: The essence of the kernel trick is to map these data points to a higher-dimensional space (for instance, from 2D to 3D) where they become linearly separable. The idea is that a pattern that is complex and convoluted in lower dimensions might become simple and linear when viewed in higher dimensions.

3. The Role of the Kernel Function: The kernel function, such as the polynomial or RBF (Radial Basis Function) kernel, implicitly performs this mapping. It calculates the dot product of the data points as if they were in this higher-dimensional space. The crucial point is that we never actually compute the coordinates of the data points in the higher-dimensional space. The kernel function simply computes the relationships (like angles and distances) between the data points as if they were in this higher-dimensional space.

An Example: From 2D to 3D

To make it more tangible, imagine a simple scenario where data points in 2D are not linearly separable. Now, using a polynomial kernel (say, of degree 2), these points are projected into a 3D space.

Original Space (2D): The points are (x, y).
Transformed Space (3D): The points become (x², √2xy, y²).

In this 3D space, it’s possible that a plane can be drawn to separate the data points, which wasn’t possible in the original 2D space. The kernel function calculates the relationships between points as if they were in this 3D space, but without us having to manually do the transformation or deal with the complexity of operating in a higher-dimensional space.

The Mathematical Perspective

Mathematically, the kernel trick is based on the fact that many machine learning algorithms, including SVMs, can be expressed in terms of dot products. For a nonlinear transformation Φ applied to data points x and y, the kernel function K is often used to represent the dot product in the transformed space: K(x, y) = Φ(x) · Φ(y).

The kernel trick says that you can replace this dot product with a kernel function that gives the same result but without explicitly computing the transformation Φ. This is a significant computational advantage, especially when dealing with very high-dimensional spaces.

In summary, the kernel trick is a clever computational shortcut that allows SVMs to efficiently classify data in a high-dimensional space without ever explicitly visiting that space. It’s a foundational concept in machine learning that elegantly addresses the issue of non-linearity in datasets, enabling the use of linear models (like SVMs) in complex, real-world problems.

When to Use an SVM and When not to

SVMs are a good choice for these scenarios:

Medium-sized datasets: SVMs are computationally efficient for moderately sized datasets.
Clear Separation: They are particularly useful when a clear margin of separation between classes is expected.
High-dimensional Data: SVMs perform well when the number of features is high.
Non-Linearity: SVMs, along with the kernel trick, are powerful tools for non-linear classification problems.

When to Consider Alternatives

Very Large Datasets: The training process of SVMs can become computationally expensive for extremely large datasets. Consider alternatives like neural networks in such cases.
Noisy Data with Overlapping Classes: SVM is not suitable when the classes in the dataset are heavily overlapped and not easily separable.
Probabilistic Output: If you need probability estimates for your classification, SVMs may not be the best choice. Look into logistic regression or Naive Bayes for those needs.
Real-time Predictions: Due to its computational complexity, it might not be the best choice for applications requiring real-time predictions.

Python Implementation from Scratch

Let’s implement a SVM from scratch for a non-linear classification problem. Note: For real-world applications, use efficient libraries like scikit-learn. The code is available in this colab notebook.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score

# Defining Kernel Functions
def linear_kernel(x1, x2):
    # Linear kernel (dot product)
    # Suitable for linearly separable data
    return np.dot(x1, x2)

def polynomial_kernel(x1, x2, degree=3):
    # Polynomial kernel
    # Maps input space into a polynomial space of specified degree
    # Useful for non-linearly separable data
    return (np.dot(x1, x2) + 1) ** degree

def rbf_kernel(x1, x2, gamma=1):
    # Radial Basis Function (RBF) or Gaussian Kernel
    # Transforms input space into an infinite dimensional space
    # Highly effective for many non-linear problems
    return np.exp(-gamma * np.linalg.norm(x1 - x2) ** 2)

# Custom Kernel SVM Class
class KernelSVM:
    def __init__(self, kernel=linear_kernel, C=1.0):
        self.kernel = kernel  # Kernel function
        self.C = C  # Regularization parameter
        # SVM parameters (to be learned during training)
        self.alpha = None  # Lagrange multipliers
        self.support_vectors = None  # Support vectors
        self.support_vector_labels = None  # Labels of support vectors
        self.b = 0  # Intercept term

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y_ = np.where(y <= 0, -1, 1)  # Adjust labels for SVM

        # Initialize Lagrange multipliers (alpha)
        self.alpha = np.zeros(n_samples)

        # Precompute the kernel matrix
        K = np.array([[self.kernel(x_i, x_j) for x_j in X] for x_i in X])

        # Gradient descent to optimize the Lagrange multipliers
        for _ in range(100):
            for i in range(n_samples):
                if y_[i] * (np.dot(K[i], self.alpha * y_) + self.b) < 1:
                    self.alpha[i] += self.C
                else:
                    self.alpha[i] -= self.C
                # Clamp alpha values within [0, C]
                self.alpha[i] = max(0, min(self.alpha[i], self.C))

        # Extract support vectors
        idx = self.alpha > 1e-5  # Threshold for identifying support vectors
        if np.sum(idx) == 0:
            self.b = 0
        else:
            self.support_vectors = X[idx]
            self.support_vector_labels = y_[idx]
            self.alpha = self.alpha[idx]

            # Compute the bias (intercept) term
            self.b = np.mean([y_[i] - np.dot(K[idx, i], self.alpha * self.support_vector_labels)
                              for i in np.where(idx)[0]])

    def predict(self, X):
        y_pred = np.zeros(X.shape[0])
        # Prediction using the sign of the decision function
        for i in range(X.shape[0]):
            y_pred[i] = np.sign(sum(self.alpha * self.support_vector_labels *
                                    np.array([self.kernel(X[i], sv) for sv in self.support_vectors])) + self.b)
        return y_pred

# Function to plot decision boundary
def plot_decision_boundary(model, X, y, kernel_name):
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
    ax = plt.gca()
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()

    # Create grid to evaluate model
    xx, yy = np.meshgrid(np.linspace(xlim[0], xlim[1], 50),
                         np.linspace(ylim[0], ylim[1], 50))

    # Predict on grid points and reshape for contour plot
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary and margins
    plt.contourf(xx, yy, Z, alpha=0.8, cmap=plt.cm.coolwarm)
    plt.title(f'SVM with {kernel_name} Kernel')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.show()

# Generate non-linearly separable dataset (moons dataset)
X, y = make_moons(n_samples=100, noise=0.1, random_state=42)
# Adjust labels for SVM (-1, 1)
y = np.where(y == 0, -1, 1)

# Define different kernels for experimentation
kernels = {
    'Linear': linear_kernel,
    'Polynomial': polynomial_kernel,
    'RBF': rbf_kernel
}

# Train and evaluate the KernelSVM with different kernels
for name, kernel in kernels.items():
    svm = KernelSVM(kernel=kernel, C=1.0)
    svm.fit(X, y)
    y_pred = svm.predict(X)
    accuracy = accuracy_score(y, y_pred)
    print(f'Accuracy with {name} kernel: {accuracy:.2f}')
    plot_decision_boundary(svm, X, y, name)

Output:

Accuracy with Linear kernel: 0.85
Accuracy with Polynomial kernel: 0.79
Accuracy with RBF kernel: 0.99

Conclusion

Support Vector Machine is a robust and versatile algorithm in machine learning. While it excels in various scenarios, it’s essential to understand its strengths and limitations. Implementing SVM from scratch, as shown, helps in understanding its underlying mechanics. For more complex applications or larger datasets, it’s recommended to use optimized libraries like scikit-learn in Python.