Neural Odyssey Series

Classic Machine Learning in Python: K-Nearest Neighbors (KNN)

Proximity-Based Predictions

9 min readFeb 6, 2024

What is KNN?

KNN relies on a straightforward principle: when given a new, unknown data point, it looks at the K nearest labeled data points and assigns the most common label among them to the new point.

This closeness is determined by a distance metric, commonly Euclidean or Manhattan distance.

The Core Principle of Proximity-Based Learning

At the core of proximity-based learning, such as the K-Nearest Neighbors (KNN) algorithm, lies a fundamental concept: closeness dictates similarity.

Grasping Similarity via Proximity

The concept translates into the algorithm’s behavior by seeking the closest ‘neighbors’ — data points that share proximity — to make decisions about new, unseen data. By assuming that nearby points are alike, the algorithm infers patterns and assigns labels based on this proximity.

Decision Boundaries: A Result of Proximity

The algorithm doesn’t just predict outcomes; it also delineates decision boundaries. These boundaries partition the feature space into regions, where each region signifies a particular class or label. Think of it as virtual borders drawn based on the proximity of data points.

Dall-E generated image with the following concept: Various data points forming groups or clusters, with neighboring points appearing closer together

Anatomy of the KNN Algorithm

Neighbor identification: KNN identifies ‘k’ nearest neighbors to a query point based on their proximity.
A decision by the majority: It predicts by a majority vote among the neighbors for classification or averaging for regression.
No explicit training phase: KNN’s lazy learning means no explicit training; it stores the entire dataset for inference.
Impact of k-value: The k parameter influences model complexity and can affect overfitting or underfitting.

Choosing the Right Distance Metrics in KNN

Each distance metric provides a unique perspective in determining proximity and contributes distinct decision boundaries within the KNN algorithm.

Understanding their characteristics aids in selecting the most appropriate metric for a given dataset.

Let's explore some key distance metrics used in KNN:

Euclidean Distance (p=2)

The most prevalent and straightforward distance measure, exclusively applicable to real-valued vectors.

**Formula:** The Euclidean distance is calculated as the straight-line distance between the query point and the target point

Manhattan Distance (p=1)

Often known as a taxicab or city block distance, this metric calculates the absolute value between two points.

**Formula:** The Manhattan distance is computed by summing the absolute differences between coordinates

Minkowski Distance

A generalized form that encompasses both Euclidean and Manhattan distances, allowing the creation of various distance metrics based on the parameter p.

Hamming Distance

Tailored for Boolean or string vectors, identifying discrepancies between vectors. Commonly referred to as the overlap metric.

Coding KNN in Python from Scratch

Implementing the K-Nearest Neighbors (KNN) algorithm from scratch allows a deep dive into its mechanics.

Let’s break down the process into distinct parts and code each step comprehensively.

Part 1: Distance Calculation

The core of KNN involves measuring distances between data points. In this step, we’ll create a function to compute the Euclidean distance between two points.

def euclidean_distance(point1, point2):
    """
    Calculate the Euclidean distance between two points.

    Parameters:
    point1 : list or array-like
        Coordinates of the first point.
    point2 : list or array-like
        Coordinates of the second point.

    Returns:
    distance : float
        The Euclidean distance between the two points.
    """
    # Ensure both points have the same dimensions
    assert len(point1) == len(point2), "Points should have the same dimensions."
    
    # Compute the Euclidean distance
    distance = sum((p1 - p2) ** 2 for p1, p2 in zip(point1, point2)) ** 0.5
    return distance

Part 2: Finding Nearest Neighbors

Next, let’s write a function to find the ‘k’ nearest neighbors of a query point within a dataset.

def find_neighbors(X_train, query_point, k):
    """
    Find the 'k' nearest neighbors of a query point within a dataset.

    Parameters:
    X_train : list or array-like
        Training dataset containing features.
    query_point : list or array-like
        Coordinates of the query point.
    k : int
        Number of neighbors to find.

    Returns:
    neighbors : list
        List of indices of the 'k' nearest neighbors.
    """
    distances = []
    
    # Calculate distance from the query point to each point in the training set
    for i, data_point in enumerate(X_train):
        distance = euclidean_distance(query_point, data_point)
        distances.append((i, distance))
    
    # Sort distances in ascending order
    distances.sort(key=lambda x: x[1])
    
    # Get indices of the 'k' nearest neighbors
    neighbors = [index for index, _ in distances[:k]]
    return neighbors

Part 3: Predicting the Class

Finally, let’s create a function to predict the class of a query point based on the majority class among its nearest neighbors.

def predict(X_train, y_train, query_point, k):
    """
    Predict the class of a query point based on the majority class among its nearest neighbors.

    Parameters:
    X_train : list or array-like
        Training dataset containing features.
    y_train : list or array-like
        Training dataset containing labels.
    query_point : list or array-like
        Coordinates of the query point.
    k : int
        Number of neighbors to consider.

    Returns:
    predicted_class : int or str
        Predicted class label for the query point.
    """
    neighbors = find_neighbors(X_train, y_train, query_point, k)
    neighbor_labels = [y_train[i] for i in neighbors]
    
    # Count occurrences of each label among neighbors
    label_counts = {}
    for label in neighbor_labels:
        if label in label_counts:
            label_counts[label] += 1
        else:
            label_counts[label] = 1
    
    # Get the label with the highest count
    predicted_class = max(label_counts, key=label_counts.get)
    return predicted_class

Exploring Wine Quality Classification with KNN

Introduction to the Dataset

The wine quality dataset comprises 11 features like acidity, residual sugar, pH, and alcohol content, aiming to predict wine quality on a scale from 1 to 10. With 4898 samples, this dataset serves as a playground for exploring classification techniques.

Static K-value: Starting Point

In this phase, we initiate our exploration by reading the dataset, splitting it into training and testing sets, and applying a KNeighborsClassifier from scikit-learn with a static k-value of 15.

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns


def read_data(file_name):
    # Load the dataset with semicolon delimiter
    data = pd.read_csv(file_name, delimiter=';')
    return data

def split_data(data):
    # Extract features and labels
    X = data.drop('quality', axis=1)
    y = data['quality']

    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.2,
        random_state=42
    )
    return X_train, X_test, y_train, y_test

def fit_model(X_train, y_train, k = 5):
    # Implement KNN classifier
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    # Predict on the train set
    train_preds = knn.predict(X_train)

    # Calculate and return accuracy on train set
    train_accuracy = accuracy_score(y_train, train_preds)
    return train_accuracy, knn


def test_model(model, X_test, y_test):
    # Predict on the test set
    test_preds = model.predict(X_test)

    # Calculate and return accuracy on test set
    test_accuracy = accuracy_score(y_test, test_preds)
    return test_accuracy

Results:
With a static k-value of 15, the model yields:

Test Accuracy: 48.44%

The next step is to find the optimum k-value that yields the best accuracy on the training data.

Optimizing KNN: Fine-Tuning Your Model

Enhanced Accuracy with GridSearchCV

Employing GridSearchCV to find the best hyperparameter k-value resulted in a noticeable accuracy surge, hitting 50%. Optimizing the number of neighbors and exploring distance metrics were pivotal in this improvement.

Results:
With an optimized k-value of 19, the model yields:

Test Accuracy: 50.94%

def find_best_k_with_grid_search(X_train, y_train, X_test, y_test, param_grid):
    # Create a KNN classifier
    knn = KNeighborsClassifier()

    # Perform GridSearchCV
    grid_search = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)

    # Get the best parameter
    best_k = grid_search.best_params_['n_neighbors']

    # Fit model with best k on entire training set
    best_knn = KNeighborsClassifier(n_neighbors=best_k)
    best_knn.fit(X_train, y_train)

    # Calculate accuracy on train and test set
    train_accuracy = best_knn.score(X_train, y_train)
    test_accuracy = best_knn.score(X_test, y_test)
    
    return grid_search.best_params_, train_accuracy, test_accuracy


# Assuming you have X_train, y_train variables
max_k=50

# Define a grid of hyperparameters
parameters = {'n_neighbors': range(2, max_k + 1)}

best_params, train_accuracy, test_accuracy = find_best_k_with_grid_search(X_train, y_train, X_test, y_test, parameters)
print(f"Best k: {best_params['n_neighbors']}")
print(f"Train Accuracy with Best k: {train_accuracy*100:.2f}%")
print(f"Test Accuracy with Best k: {test_accuracy*100:.2f}%")

Weighing Distance for Precision

By incorporating distance-based weighted averaging within GridSearchCV, the model’s accuracy jumped to nearly 51.56%. Considering the proximity of neighbors in the prediction significantly contributed to this progress.

We added both distance and uniform mode in the parameters to find the best k-value.

parameters = {
    "n_neighbors": range(2, max_k + 1),
    "weights": ["uniform", "distance"]
}

Results:
With an optimized k-value of 43, and using weighted distance, the model yields:

Test Accuracy: 51.56%

Ensemble Modeling with Bagging

Implementing Bagging with KNN led to a substantial leap in accuracy, reaching around 60.31%. Combining multiple KNN models through Bagging brought more robust predictions.

Results:
With a bagging ensemble model with 100 estimators, each randomly having 0.3 of the data.

Test Accuracy: 60.31%

def calculate_bagged_knn(X_train, y_train, X_test, y_test, best_k, best_weights):
    # Create a KNeighborsClassifier with best parameters
    knn = KNeighborsClassifier(n_neighbors=best_k, weights=best_weights)
    
    # Create BaggingClassifier with KNeighborsClassifier as base estimator
    bagged_knn = BaggingClassifier(estimator=knn, n_estimators=100, max_samples=0.3)
    
    # Fit the BaggingClassifier on the training data
    bagged_knn.fit(X_train, y_train)
    
    # Calculate training and test accuracies
    train_accuracy = bagged_knn.score(X_train, y_train)
    test_accuracy = bagged_knn.score(X_test, y_test)

    return train_accuracy, test_accuracy

The complete code is accessible on GitHub.

Pros and Cons of the kNN Algorithm

K-Nearest Neighbors (KNN) brings forth advantages and limitations, offering interpretability and adaptability while grappling with computational demands and challenges in high-dimensional spaces.

Advantages of KNN

Interpretable and Fast Development: KNN offers interpretability, allowing users to comprehend its functioning. Its simplicity facilitates the rapid development of models without the complexity of more advanced techniques.
Adaptable to New Data: It easily accommodates new data points without retraining, adjusting its predictions based on new examples added to the dataset.
Few Hyperparameters: Requires minimal parameter tuning — mainly ‘k’ and choice of distance metric — simplifying the training process.
Robustness to Noisy Data: Demonstrates resilience to noisy training data, aiding in effective classification, especially with a large dataset.
Robustness to Large Training Data: Can be more effective with larger training datasets, leveraging the abundance of information for predictions.

Drawbacks

Computationally Intensive and Resource-Heavy: As a lazy algorithm, KNN requires significant computing power and storage due to storing all data points, making it time and resource-consuming.
Curse of Dimensionality: Faces challenges with high-dimensional data, struggling to properly classify data points in higher dimensions, potentially leading to less accurate predictions.
Needs Optimal ‘k’ Selection: The choice of ‘k’ can significantly impact performance, requiring careful selection and optimization, which might be complex at times.
Slower Performance with Increased Data: The algorithm’s efficiency decreases notably as the dataset size or number of predictors/independent variables grows, affecting its speed.

Dall-E generated image with the following concept: Abstract machine learning using proximity-based algorithms

Wrapping Up: Leveraging KNN in Python ML

In this comprehensive exploration of K-Nearest Neighbors (KNN) in Python, we delved into the algorithm’s fundamentals, its pivotal components, and practical implementation aspects.

We’ve reviewed:

Algorithm Insights: Understanding how KNN classifies based on proximity to neighbors.
Significance of K-Value: Recognizing the impact of k-value on model performance and the trade-off between bias and variance.
Distance Metrics Importance: Appreciating the role of distance metrics in shaping decision boundaries and influencing predictions.
KNN Implementation from Scratch: Crafting KNN code from the ground up, enhancing comprehension of its inner workings.
Fine-Tuning on Wine Quality Dataset: Iteratively optimizing KNN on the wine quality dataset to elevate its predictive capacity.
Advantages and Disadvantages: Weighing the pros and cons to comprehend where KNN excels and where it faces limitations.

Hope you enjoyed the KNN pattern exploration.

Happy training! 👩‍💻

Explore the GitHub Repo 🎉

GitHub - AmirLavasani/retro-machine-learning: ⏱ Original implementations from scratch of timeless…

⏱ Original implementations from scratch of timeless classic machine learning algorithms such as KNN, SVM, PCA, Decision…

github.com