ML Series: Day 15 — KNN — Part 1

10 min readMay 12, 2024

The k-nearest neighbor (KNN) algorithm is a popular supervised machine learning algorithm utilized for classification and regression tasks. It serves as a prime example of a non-parametric lazy learning algorithm. KNN is considered a lazy learning algorithm because it does not make any assumptions about the underlying data distribution or learn a specific model from the training data during the training phase. Instead, it “lazy” or “deferred” learning, where it simply memorizes the training dataset.

KNN is also a non-parametric algorithm because it does not have a fixed number of parameters that are determined during the training process. The number of parameters in KNN grows with the size of the training data, as each data point becomes a parameter. This is unlike parametric models such as linear regression or neural networks, which have a fixed number of parameters regardless of the training data size.

In KNN, a label, denoted as x_i, is assigned to an instance based on the majority label among the k closest matching training samples. This approach allows KNN to make predictions by comparing the similarity between instances and their neighboring data points.
In the K-nearest neighbor (KNN) algorithm, a data point is located by searching for its nearest neighbors using distance metrics. The algorithm then classifies or predicts the output based on the majority class or the average value of its k-nearest neighbors. For instance, in a classification problem where k=5, when a new data point is presented to the algorithm, the five closest data points in the training set to the new point (determined by a distance metric like Euclidean distance) are considered. The class label of the new point is assigned based on the majority class among its five neighbors. In a regression problem, instead of assigning a class label, the algorithm predicts the output value by taking the average value of its k-nearest neighbors. Figure 9–4 illustrates the workings of the KNN algorithm.

The new data point is represented by a question mark to signify its unknown label or value. As mentioned earlier, the algorithm utilizes the majority label among the k closest training samples to classify or predict the output for the new data point.

The choice of the hyperparameter k is crucial in KNN as it determines the number of neighbors considered for prediction. A smaller value of k results in a more complex decision boundary, which may increase the risk of overfitting the training data. On the other hand, a larger value of k can lead to a smoother decision boundary but might cause underfitting, where the model oversimplifies the relationships between data points.

Determining the optimal value of k often relies on the specific problem at hand and can be accomplished through experimentation or cross-validation. It’s essential to strike a balance between model complexity and generalization by selecting an appropriate value of k that achieves satisfactory performance on unseen data.

Figure 2 shows the effect of choosing the value of k.

Fig 2. Selection of different values of k

In Figure 2, the horizontal axis refers to the number of characteristics or the number k, and the horizontal axis also refers to the accuracy of the model.

In this tutorial, we’ll explore the implementation of the K-Nearest Neighbors (KNN) algorithm from scratch in Python. KNN is a fundamental machine learning algorithm used for classification and regression tasks. By building KNN from scratch, we’ll gain a deeper understanding of its inner workings and learn how to apply it to real-world datasets.

Let’s dive in and see how we can implement KNN algorithm step by step!

If you’re interested in exploring the code further or trying it out yourself, you can find the complete implementation in my GitHub repository. Feel free to clone or fork the repository to experiment with the code: [GitHub repository]

Step 1: Importing Libraries

We begin by importing the necessary libraries. We’ll use NumPy for numerical computations, Matplotlib for data visualization, mlxtend for plotting decision regions, and scikit-learn for generating synthetic data and splitting it into training and testing sets.

import numpy as np
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

Step 2: Defining the KNN Class

The KNN class serves as the core component of our K-Nearest Neighbors (KNN) algorithm implementation. Within this class, we encapsulate all the essential functionality needed to instantiate, train, and utilize the KNN model for making predictions. Key methods within the class include:

__init__: Initializes the KNN classifier with a default value of k, representing the number of neighbors to consider during prediction.
fit: Fits the model to the provided training data, storing the training features (X_train) and corresponding labels (y_train).
euclidean_distance: Computes the Euclidean distance between two data points, a crucial metric used in determining the similarity between instances.
predict: Takes a set of input data points and returns an array of predicted labels based on the k-nearest neighbors.
_predict: Internal method used by predict to calculate predictions for individual data points by identifying the k-nearest neighbors and applying a majority voting scheme.

By encapsulating these functionalities within the KNN class, we create a modular and reusable implementation of the KNN algorithm, making it easier to understand, maintain, and extend. This is the KNN Class:

class KNNClassification:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def euclidean_distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2)**2))

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        distances = [self.euclidean_distance(x, x_train) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        most_common = max(set(k_nearest_labels), key=k_nearest_labels.count)
        return most_common

def plot_data(X, y):
    # (o==circle, for train) and (s==square for test)
    plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, s=100, cmap='jet', marker='o')
    plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, s=100, cmap='jet', marker='s')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Generated Data')
    plt.show()

This part of the code serves as the main execution section, where we instantiate the KNN algorithm, train it on synthetic data, visualize the decision boundary, evaluate its performance, and plot the training and test data points along with the predicted labels.

if __name__ == "__main__":
    X, y = make_classification(n_samples=20, n_features=2, n_informative=2, n_redundant=0,
                               n_clusters_per_class=1, class_sep=1., random_state=23)
    plot_data(X, y)

    knn = KNNClassification(k=3)
    knn.fit(X, y)

    plot_decision_regions(X, y, clf=knn, legend=2)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('KNN Decision Boundary')
    plt.show()

    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=19)

    y_pred = knn.predict(x_test)

    # Calculate accuracy
    accuracy = np.mean(y_pred == y_test) * 100
    print(f"Accuracy: {accuracy:.2f}%")

    # (o==circle, for train), (s==square for test), and (x for predicted test)
    plt.scatter(x_train[:, 0], x_train[:, 1], c=y_train, s=100, cmap='jet', marker='o', label='Train (Circle)')
    plt.scatter(x_test[:, 0], x_test[:, 1], c=y_test, s=100, cmap='jet', marker='s', label='Test (Square)')
    plt.scatter(x_test[:, 0], x_test[:, 1], c=y_pred, s=150, cmap='jet', marker='x', label='Predicted Test (Cross)')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('Train-Test, Predicted')
    plt.legend()
    plt.show()

Challenges of K-Nearest Neighbors

While K-Nearest Neighbors (KNN) is a straightforward and effective algorithm, it’s not immune to challenges. In this discussion, we’ll explore some common issues encountered when using KNN and demonstrate them through code examples. By examining how KNN handles increasing dataset sizes, we aim to illustrate the algorithm’s scalability limitations and shed light on the trade-offs between computational time and memory usage. Let’s dive into the code to uncover these challenges.

import time

# Define a range of sample sizes to test
sample_sizes = [1000, 5000, 10000, 50000, 100000]

# Lists to store time and memory usage for each sample size
execution_times = []
memory_usages = []

# Loop through each sample size
for sample_size in sample_sizes:
    print(f"Testing with {sample_size} data points...")
    
    # Generate synthetic data
    X, y = make_classification(n_samples=sample_size, n_features=2, n_informative=2, n_redundant=0,
                               n_clusters_per_class=1, class_sep=1., random_state=42)
    
    # Split data into training and testing sets
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
    
    # Calculate memory usage for training data
    train_memory_usage = x_train.size * x_train.itemsize
    memory_usages.append(train_memory_usage)
    
    # Instantiate and fit the KNN model
    model = KNN(3)
    model.fit(x_train, y_train)
    
    # Measure prediction time for a single test data point
    start_time = time.time()
    model.predict(x_test[[0], :])
    end_time = time.time()
    execution_time = end_time - start_time
    execution_times.append(execution_time)

# Plotting the results
plt.figure(figsize=(10, 5))

# Plot execution time
plt.subplot(1, 2, 1)
plt.plot(sample_sizes, execution_times, marker='o')
plt.xlabel('Number of Data Points')
plt.ylabel('Execution Time (seconds)')
plt.title('Execution Time vs. Number of Data Points')

# Plot memory usage
plt.subplot(1, 2, 2)
plt.plot(sample_sizes, memory_usages, marker='o')
plt.xlabel('Number of Data Points')
plt.ylabel('Memory Usage (bytes)')
plt.title('Memory Usage vs. Number of Data Points')

plt.tight_layout()
plt.show()

After running the above code, we can see the plot:

Fig 1. Computational time and memory usage for KNN

KNN Regression: Predictive Modeling with K-Nearest Neighbors

Regression with KNN involves predicting continuous target values by determining the average of the nearest neighbors’ target values for each data point.

I want to change the previous code and convert it to a code that is suitable for regression.

Differences between Regression and Classification Codes:

Dataset Generation:

Previous (Classification): The dataset was generated using `make_classification` to create synthetic classification data.
Current (Regression): Synthetic regression data was generated using `make_regression`.

2. Model Prediction:

The KNN model predicted discrete class labels for each data point instead of continuous target values.

3. Evaluation Metric:

Since this is a regression task, traditional evaluation metrics such as Mean Squared Error (MSE) or R-squared could be used instead of accuracy.

4. Visualization:

Previous: Decision boundaries were visualized using `plot_decision_regions` to illustrate classification regions.
Current: The regression line is visualized to showcase the KNN model’s predictive capabilities for regression tasks.

5. Label Encoding:

No label encoding is required for regression, as the target values are continuous instead of integers.

Here’s the modified code for regression with KNN:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

class KNNRegression:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        self.X_train = X
        self.y_train = y

    def euclidean_distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2)**2))

    def predict(self, X):
        y_pred = [self._predict(x) for x in X]
        return np.array(y_pred)

    def _predict(self, x):
        distances = [self.euclidean_distance(x, x_train) for x_train in self.X_train]
        k_indices = np.argsort(distances)[:self.k]
        k_nearest_labels = [self.y_train[i] for i in k_indices]
        return np.mean(k_nearest_labels)

if __name__ == "__main__":
    # Generate synthetic regression data
    X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)

    # Instantiate and fit the KNN regression model
    knn = KNNRegression(k=3)
    knn.fit(X, y)

    # Generate a range of feature values for plotting the regression line
    x_range = np.linspace(min(X), max(X), num=100).reshape(-1, 1)
    # Predict target values for the feature range
    y_pred = knn.predict(x_range)

    # Plot the data points and the regression line
    plt.scatter(X, y, color='blue', s=30, marker='o', label='Data')
    plt.plot(x_range, y_pred, color='red', linewidth=2, label='KNN Regression')
    plt.xlabel('Feature')
    plt.ylabel('Target')
    plt.title('KNN Regression')
    plt.legend()
    plt.show()

Depicting the relationship between the feature and target variables through a smooth regression line:

Exploring the Impact of K in K-Nearest Neighbors (KNN) Algorithm: Balancing Overfitting and Underfitting

In this analysis, we delve into the significance of the parameter “K” in the K-Nearest Neighbors (KNN) algorithm and its effects on classification accuracy, decision boundaries, overfitting, and underfitting.

Overfitting often occurs when the value of K is too small in the K-Nearest Neighbors (KNN) algorithm. With a small K value, the model becomes overly sensitive to noise and outliers in the data, resulting in complex decision boundaries that fit the training data too closely. This can lead to poor generalization to unseen data, as the model may capture noise rather than underlying patterns.

Conversely, underfitting can occur when the value of K is too large. With a large K value, the model averages over too many neighbors, leading to overly simplistic decision boundaries that fail to capture the true structure of the data. This can result in low accuracy on both the training and test datasets, indicating that the model is unable to sufficiently capture the complexity of the underlying patterns.

By systematically varying the value of K and observing its influence on model performance and behavior, we gain insights into the trade-offs involved and uncover optimal strategies for parameter selection in KNN-based tasks, ensuring a balance between overfitting and underfitting.

from sklearn.metrics import mean_squared_error

def test_k_values(X_train, y_train, X_range, k_values):
    fig, axs = plt.subplots(2, 2, figsize=(12, 10))
    axs = axs.flatten()
    for i, k in enumerate(k_values):
        
        # The regression class that we wrote in the previous codes
        knn_reg = KNNRegression(k=k)

        # Fit the KNN regression model to the training data
        knn_reg.fit(X_train, y_train)

        # Predict target values for the range of feature values
        y_pred = knn_reg.predict(X_range)
        mse = mean_squared_error(y_train, knn_reg.predict(X_train))
        axs[i].scatter(X_train, y_train, color='blue', s=30, marker='o', label='Data')
        axs[i].plot(X_range, y_pred, color='red', linewidth=2, label=f'KNN Regression (K={k}), MSE: {mse:.2f}')
        axs[i].set_xlabel('Feature')
        axs[i].set_ylabel('Target')
        axs[i].set_title(f'KNN Regression (K={k})')
        axs[i].legend()
    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    # Generate synthetic regression data
    X, y = make_regression(n_samples=100, n_features=1, noise=10, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)
    
    # Generate a range of feature values for plotting the regression line
    X_range = np.linspace(min(X), max(X), num=100).reshape(-1, 1)
    
    # Test different values of K and plot the results
    k_values = [1, 3, 5, 7]
    test_k_values(X_train, y_train, X_range, k_values)

This is the output of the code:

Outlining the pros and cons of K-Nearest Neighbors (KNN):

Enhancements and Extensions of the K-Nearest Neighbors (KNN) Algorithm: Addressing Limitations and Improving Performance

K-Nearest Neighbors (KNN) algorithm, while simple and intuitive, has its share of disadvantages. However, researchers and practitioners have developed various improvements and extensions to address these limitations. Each of these methods aims to resolve specific aspects of the algorithm’s drawbacks. For instance, techniques like weighted KNN or distance-weighted KNN attempt to mitigate the algorithm’s sensitivity to irrelevant features by assigning different weights to neighboring points based on their distance. Similarly, methods like KD-trees or Ball trees optimize the algorithm’s computational efficiency for large datasets by organizing the data into spatial data structures. In the next part of our machine learning series, we will delve deeper into these improvements and extensions, exploring how they enhance the performance and versatility of the KNN algorithm.