A Closer Look at SVM and K-Nearest Neighbors for Data Classification

Suman Kumar Roy
8 min readAug 7, 2023

Support Vector Machine

A Support Vector Machine (SVM) is a powerful machine-learning tool used for classifying data. It finds the best line or boundary to separate different groups, making it useful for tasks like image recognition and text classification. SVMs work well even with complex data and can handle both straight and curved separations. They’re widely used in various fields due to their versatility and effectiveness.

Advantages

Support Vector Machines (SVMs) offer several advantages that make them a valuable choice in machine learning:

  1. Effective in High-Dimensional Spaces: SVMs excel when dealing with data in high-dimensional spaces, where traditional methods might struggle. This makes them suitable for complex real-world datasets.
  2. Versatile Kernel Functions: SVMs can handle linear and non-linear relationships between variables through kernel functions. This enables them to capture intricate patterns and make accurate predictions in various scenarios.
  3. Optimal Margin: SVMs aim to find the optimal hyperplane that maximizes the margin between different classes, leading to better generalization and reduced overfitting.
  4. Robust to Outliers: SVMs are less sensitive to outliers in the data due to their focus on support vectors, which are the most critical data points for defining the decision boundary.
  5. Binary and Multi-Class Classification: SVMs inherently handle binary classification, but they can also be extended for multi-class problems using techniques like one-vs-one or one-vs-all.
  6. Regularization Control: SVMs allow control over the regularization parameter, which influences the trade-off between fitting the training data and preventing overfitting.
  7. Interpretability: SVMs provide insight into the importance of different data points through the support vectors, aiding in model interpretability and understanding.
  8. Applicability: SVMs find applications in various fields, including image recognition, text classification, finance, and biology, showcasing their versatility and wide-ranging utility.

Usage

SVMs are widely used for classification tasks, effectively separating data into distinct categories. They excel in scenarios like spam detection, medical diagnosis, and sentiment analysis. In image recognition, SVMs identify objects and patterns. In text analysis, they classify topics and sentiments. In bioinformatics, SVMs predict protein structures and aid medical diagnoses. In finance, they forecast stock prices and detect fraud.

Limitations

SVMs can become computationally demanding for large datasets or high-dimensional spaces. Proper parameter selection is essential, as poorly chosen values can impact performance. They may require significant memory resources, and without proper tuning, they can overfit small datasets. While effective, SVMs lack transparency in decision-making and may be less interpretable. Ensuring proper data scaling is important, and SVMs don’t inherently provide probabilistic outputs.

Understanding SVM

A Support Vector Machine (SVM) is a powerful machine learning algorithm used for both classification and regression tasks. It works by finding an optimal hyperplane that best separates data points belonging to different classes while maximizing the margin between them. This hyperplane serves as a decision boundary, enabling the SVM to classify new, unseen data points based on which side of the boundary they fall.

Imagine you have a dataset of two classes, ‘A’ and ‘B,’ representing different types of fruits given in the image below. Each fruit is characterized by its weight and sweetness level. You want to build a model that can predict whether a fruit is of type ‘A’ or ‘B’ based on these two features.

Here’s where the SVM comes in:

  1. Data Visualization: First, you plot the data points on a 2D plane, with weight on the x-axis and sweetness on the y-axis. The different classes are represented by different colors and shapes: ‘A’ in blue square shape and ‘B’ in red rounded.
  2. Finding the Hyperplane: The SVM’s goal is to find the hyperplane that best separates the ‘A’ and ‘B’ fruits while maximizing the margin between them. This margin is the distance between the closest data points from each class to the hyperplane. These closest data points are called support vectors.
  3. Making Predictions: Once the optimal hyperplane is identified, new, unseen fruits can be classified based on which side of the hyperplane they fall. If a fruit lies on the ‘A’ side of the hyperplane, it will be classified as ‘A,’ and if it falls on the ‘B’ side, it will be classified as ‘B.’
  4. Handling Non-Linearity: In cases where the data is not linearly separable, SVMs can use kernel functions to transform the data into a higher-dimensional space where separation is possible. For example, if the original 2D data is not linearly separable, applying a radial basis function (RBF) kernel can create a higher-dimensional space where a hyperplane can separate the classes effectively.

In this example, the SVM algorithm has successfully learned a decision boundary that distinguishes between ‘A’ and ‘B’ fruits. It can now classify new fruits based on their weight and sweetness levels.

Support Vector Machines are powerful tools in machine learning, capable of handling complex data and finding optimal decision boundaries for various applications.

Support Vector Machine with Decision Boundary and Margin [Link]
Linear and Non-Linear SVM [Link]

Code from Scratch

Below is the scratch code for Linear SVM

In conclusion, SVMs are versatile and powerful machine learning algorithm used for classification and regression tasks. They find optimal decision boundaries to separate data points of different classes, even in high-dimensional or complex scenarios. SVMs offer robustness, effective handling of non-linear data, and application across various fields. Despite certain limitations, SVMs remain a valuable tool for accurate predictions and data analysis.

K-Nearest Neighbors

K-Nearest Neighbors (KNN) is a simple yet effective machine-learning technique for classification and regression. It predicts outcomes based on the majority of its nearest data points. KNN is versatile, handling various data types and relationships, and finds use in recommendation systems, image recognition, and more. It’s a valuable tool for quick predictions with minimal training.

Advantages

  1. Simplicity: KNN is easy to understand and implement, making it a straightforward choice for those new to machine learning.
  2. No Separate Training: Unlike many algorithms, KNN doesn’t require an extensive training phase. It swiftly incorporates new data points, making it adaptable to evolving datasets.
  3. Versatility: KNN performs well in both classification and regression tasks, accommodating a wide range of predictive problems.
  4. Complex Pattern Recognition: KNN excels at capturing intricate patterns within data, especially when relationships are non-linear or multifaceted.
  5. Dynamic Data Handling: KNN is naturally suited for scenarios with changing data, making it an effective choice for real-time applications.
  6. Ensemble Capability: KNN can serve as a fundamental building block in ensemble learning methods, contributing to improved overall model performance.
  7. Effective with Small Datasets: KNN’s reliance on nearest neighbors makes it robust and accurate, particularly when working with limited amounts of data.
  8. Simplified Model Tuning: KNN typically involves fewer hyperparameters compared to other algorithms, streamlining the process of model configuration and selection.

Usage

KNN is commonly used for classification, such as in email filtering, medical diagnoses, and sentiment analysis. It contributes to recommendation systems by identifying similar users or items. KNN aids in anomaly detection, recognizing unusual data points. It handles regression tasks, predicting continuous values like housing prices. KNN is valuable in image recognition, identifying objects and categorizing images. It aids in imputing missing data, clustering, and various tasks in fields like bioinformatics, geospatial analysis, and market segmentation.

Limitations

KNN’s computational efficiency decreases with larger datasets. In high-dimensional spaces, KNN’s performance can degrade. It might favor majority classes in imbalanced datasets. Choosing the optimal number of neighbors (‘K’) is essential. KNN is sensitive to noise and outliers. Its decisions are influenced by local data structure, potentially missing global patterns. Handling categorical features requires preprocessing. The choice of distance metric affects results. Features with larger scales can bias KNN’s predictions. Memory usage can be substantial when storing the entire dataset.

Factors for selecting ‘K’

Understanding KNN

K-Nearest Neighbors (KNN) is a basic machine learning algorithm used for classification and regression. It predicts outcomes by looking at the ‘K’ closest data points and assigning labels or values based on their majority or average. Similarity between data points is measured using distances. KNN is simple, versatile, and useful for quick predictions.

Imagine you’re a botanist working with different species of iris flowers. You have a dataset containing three species: Setosa, Versicolor, and Virginica. Each iris is described by its sepal length, sepal width, petal length, and petal width.

  • You gather data on 150 iris flowers, with 50 samples from each species. Each flower is represented as a data point with its features.
  • You visualize the data by creating scatter plots of the different iris species, with two features on the axes.
  • You split the dataset into a training set and a test set. Let’s say you allocate 120 samples for training and 30 for testing.
  • You decide to use KNN to classify iris flowers. To begin, you need to choose the number of neighbors, ‘K,’ to consider when making predictions.
  • Now, let’s say you want to classify a test iris with the following measurements: sepal length = 5.0, sepal width = 3.5, petal length = 1.5, petal width = 0.2. KNN identifies the ‘K’ nearest neighbors from the training set based on Euclidean distance. Let’s say ‘K’ is set to 5. The closest five training samples are Versicolor, Versicolor, Setosa, Versicolor, and Setosa.
  • Among these ‘K’ neighbors, the majority class is Versicolor, which appears three times. KNN predicts the test iris as belonging to the Versicolor species.
  • You compare the predicted class with the actual class of the test iris to evaluate the accuracy of the KNN model.

In this example, KNN classified an iris based on the classes of its nearest neighbors. The assumption is that similar iris flowers (based on their features) belong to the same species. The choice of ‘K’ and the distance metric (like Euclidean distance) influence the results.

KNN’s simplicity lies in its reliance on nearby data points for predictions. However, it’s essential to fine-tune ‘K’ and preprocess the data correctly for optimal performance. This example illustrates how KNN can be a useful tool for classification tasks, especially when the relationships between features and classes are intuitive and nearby points tend to have similar labels.

KNN Claffication as a demo [Link]

Code from Scratch

Find the below code for KNN implementation

So from this code, we implement our KNN model and it gives us more than 80% accuracy, which is pretty good as a classification in unsupervised learning.

In conclusion, K-Nearest Neighbors (KNN) is a straightforward and intuitive machine learning algorithm used for classification and regression tasks. It operates by identifying the ‘K’ nearest data points to make predictions based on their labels or values. While simple and adaptable, KNN’s performance depends on proper parameter tuning and feature scaling. It serves as a foundational method in machine learning, particularly suitable for quick insights and initial data exploration.

#Quantum30 #Day_7

--

--

Suman Kumar Roy

Completed M. Tech at NITK, Surathkal, Quantum Researcher @TCS Research, Quantum Computing, QML and Algorithm Enthusiast