Support Vector Machine(SVM)

Nermeen Abd El-Hafeez
8 min readSep 7, 2023

--

What is Support Vector Machine(SVM)?

A Support Vector Machine (SVM) stands as a robust supervised machine learning algorithm used for classification and regression tasks. Its core objective is to locate an optimal hyperplane within a multi-dimensional feature space, efficiently segregating data points belonging to distinct classes.

In a classification scenario, where we have a dataset consisting of two distinct classes, let’s say Class A and Class B, the SVM aims to create a hyperplane maximizing the margin between these classes.

Margin and Support Vectors

The margin, in this context, refers to the distance between the hyperplane and the nearest data points from each class, aptly termed support vectors. This margin-maximizing hyperplane subsequently serves as the decision boundary, enabling SVM to categorize new data points into either Class A or Class B, based on their relative positions to the hyperplane.

Mathematical Insight into Support Vector Machines

Dot Product

To delve deeper into SVM, consider the concept of a dot product. In the context of two vectors, say a and b, positioned within a two-dimensional Cartesian plane:

  • Vector a possesses a magnitude of 8 and forms a 115-degree angle with the x-axis.
  • Vector b boasts a magnitude of 10 and forms a 45-degree angle with the x-axis.

The angle between these two vectors, denoted by the Greek letter theta (θ), measures 70 degrees, calculated by subtracting 45 degrees from 115 degrees.

The dot product between these vectors involves multiplying the magnitude of vector a by the magnitude of vector b, followed by multiplication with the cosine (cos) of the angle between them, as expressed in the equation:

a • b = |a| × |b| × cos(θ)

This yields a dot product of 27.36 for the given vectors.

When the magnitudes and angles of vectors are unknown, you can compute the dot product using the following formula:

a • b = (ax × bx) + (ay × by)

For instance, vector a measures -3.4 on the x-axis and 7.3 on the y-axis, while vector b measures 7.1 on both the x and y-axes due to its 45-degree angle. Plugging these values into the formula yields a dot product of 27.71.

Use of Dot Product in SVM

In SVM, dot products play a pivotal role in determining the position of data points relative to a hyperplane. To ascertain if a random point X falls on the right or left side of the plane (positive or negative), we first treat X as a vector. Subsequently, we establish another vector w, perpendicular to the hyperplane, with a distance c from the origin to the decision boundary. We then project vector X onto w.

The Role of Margins

Line equation

In the SVM context, the equation of a line plays a crucial role. The equation y = mx + b characterizes a line, where y and x denote coordinates on the line, m represents the slope, and b indicates the y-intercept — the point where the line crosses the y-axis.

For lines with coefficients a and b and a constant c, the general equation ax + by + c = 0 captures their characteristics. This equation, known as the general form of a linear equation for a straight line, can be transformed into the standard slope-intercept form y = mx + b, making it more interpretable. This transformation is achieved by isolating y, leading to the expressions m = -a/b and b = -c/b.

Understanding the Plane Equation

When transitioning to three-dimensional space, a plane equation takes shape as ax + by + cz + d = 0. Here, a, b, and c represent coefficients defining the plane’s orientation or normal vector (perpendicular to the plane), while x, y, and z denote variables corresponding to coordinates in 3D space. d acts as a constant term. The equation ensures that any point (x, y, z) satisfying it resides on the plane. The specific values of a, b, and c determine the plane’s orientation, with d influencing its position along the normal vector.

The Role of Hyperplanes

In an n-dimensional space, hyperplanes are defined using an equation like a1x1 + a2x2 + … + anxn + b = 0. Here, a1, a2, …, an represent coefficients defining the hyperplane’s orientation or normal vector, while x1, x2, …, xn denote variables for coordinates in this n-dimensional space. The constant b remains a crucial term. This equation characterizes a hyperplane within the space.

Handling Non-Linear Data: The Kernel Trick

However, what if the data isn’t linearly separable? Here’s where the kernel trick comes into play within Support Vector Machines. This ingenious technique enables the effective handling of non-linear data by transforming it into a higher-dimensional space, where it might become linearly separable. This transformation is made possible through kernel functions. Let’s explore how this process works:

Mapping for Non-Linear Data Transformation:

Imagine dealing with 2D data points (x, y) that don’t exhibit linear separability in their native form. To address this, we introduce a mapping function, denoted as Ф: R² → R³, which transforms each 2D data point into a 3D representation. An example mapping function could be:

Ф(x, y) = (x², √2xy, y²)

For instance, if we have a point (2, 3) in 2D space, applying this mapping function leads to a corresponding 3D point. Given x = 2 and y = 3, the 3D point becomes (2², √2 * 2 * 3, 3²) = (4, 6√2, 9). This transformation enhances data separability in the higher-dimensional space.

Understanding Kernels:

Kernels constitute a set of functions used for data transformation from lower to higher dimensions, enabling efficient dot product computation in the higher-dimensional space. Kernels are integral to the kernel trick.

Kernel Trick for Second-Degree Polynomial Mapping:

In practice, explicitly transforming data using specific polynomial mappings (x², √2xy, y²) for each data point can be computationally demanding, especially with large datasets. This is where the kernel trick shines. It allows implicit data transformation, bypassing the explicit calculation of coordinates in the higher-dimensional space for each data point. Instead, it leverages dot products between data points in the higher-dimensional space — a computationally efficient approach. This empowers Support Vector Machines to adeptly handle non-linear data.

Consider two data points, a and b. The expression Ф(a)ᵀ ⋅ Ф(b) represents the dot product of their transformed vectors in the higher-dimensional space. Here’s a breakdown:

  • Ф(a) and Ф(b) represent the transformed vectors of a and b in the higher-dimensional space, achieved through kernel or mapping functions.
  • The ᵀ symbol denotes the transpose operation, flipping a vector between a row vector and a column vector.
  • The dot product operation computes the sum of pairwise products of components from the two vectors.

In essence, Ф(a)ᵀ ⋅ Ф(b) efficiently calculates the dot product of a and b in the higher-dimensional space — a fundamental concept in kernel methods like SVMs. This dot product quantifies the similarity or distance between data points in the higher-dimensional feature space, all while avoiding explicit computation of transformed coordinates.

Common Kernel Types in SVMs and Associated Parameters:

1. Linear Kernel

  • Kernel Function: K(x₁, x₂) = x₁ · x₂
  • Parameters: None
  • Use: Suitable for linearly separable data.

2. Polynomial Kernel

Kernel Function: K(x₁, x₂) = (γx₁ · x₂ + r)ᵈ

Parameters:

  • Degree (d): Controls the degree of the polynomial transformation.
  • Coefficient (γ): A scaling factor for the dot product.
  • Constant Term (r): A constant added to the dot product.

Use: Suitable for non-linear data where ‘d’ controls the level of non-linearity.

3. Gaussian Kernel (RBF — Radial Basis Function)

Kernel Function: K(x₁, x₂) = exp(-γ||x₁ — x₂||²)

Parameters:

  • Gamma (γ): Controls the spread of the Gaussian curve.

Use: Effective for non-linear and complex data.

4. Sigmoid Kernel

Kernel Function: K(x₁, x₂) = tanh(αx₁ · x₂ + r)

Parameters:

  • Alpha (α): Scales the dot product before applying the hyperbolic tangent.
  • Constant Term (r): An offset added to the dot product.

Use: Useful for non-linear data, especially in neural network applications.

5. Laplacian Kernel

Kernel Function: K(x₁, x₂) = exp(-γ||x₁ — x₂||)

Parameters:

  • Gamma (γ): Controls the width of the kernel function.

Use: Similar to Gaussian kernel with a sharper peak.

6. Hyperbolic (Sigmoid) Kernel

Kernel Function: K(x₁, x₂) = tanh(x₁ᵀx₂ + r)

Parameters:

  • Constant Term (r): An offset added to the dot product.
  • Use: Useful for non-linear classification problems.

7. ANOVA Kernel

Kernel Function: K(x₁, x₂) = exp(-γ∑(x₁ᵢ — x₂ᵢ)²)

Parameters:

  • Gamma (γ): Controls the width of the kernel function.

Use: Suitable for feature selection in high-dimensional spaces.

8. Radial-Basis Function (RBF) Kernel

Kernel Function: K(x₁, x₂) = exp(-γ||x₁ — x₂||²)

Parameters:

  • Gamma (γ): Controls the spread of the Gaussian curve.

Use: Maps data to an infinite-dimensional space for handling non-linear data.

In summary, Support Vector Machines (SVMs) are a vital tool in the field of machine learning. They excel in tasks involving classification and regression by finding optimal hyperplanes to separate different data classes. Through the use of dot products and mathematical constructs, SVMs create precise decision boundaries.

SVMs truly shine when dealing with non-linear data. The kernel trick, supported by various kernel types and parameters, allows SVMs to handle complex, real-world datasets effectively.

--

--

Nermeen Abd El-Hafeez

Passionate data science enthusiast, 2 years of self-study. Proficient in Python, skilled in data analysis, machine learning, and deep learning techniques.