A beginner’s guide to building a KNN model: Everything You Need To Know

Learn the basics of K-Nearest Neighbors (KNN) algorithm and how to build a KNN model step-by-step for data classification

7 min readMay 10, 2023

Requirements:

Good understanding of Python programming language, including basic data types, control structures, and functions.
Basic concept of machine learning, including their types, and appropriate algorithm for labelled or unlabeled data.
A good foundation in linear algebra and statistical concepts like mean, variance, probability, vectors, matrices, e.t.c
Data preprocessing techniques, such as standardization, scaling and normalization

Introduction to KNN

K-Nearest Neighbors, popularly referred to as KNN, is an example of a supervised learning algorithm that can be used for both classification and regression, but mostly used for classification tasks in machine learning.

The basic idea behind KNN is to classify or predict the label of test data by examining the k closest data points in the training set, where k is a hyperparameter chosen by the user.

For example, Suppose you’re in a building with two rooms — Room 1 and Room 2. Room 1 is filled with girls and Room 2 is filled with boys. You’re standing in the hallway, and you notice a student standing alone near the entrance of Room 2.

Using the K-Nearest Neighbors (KNN) algorithm, you could predict which room the student belongs to based on the students nearest neighbors. In this case, the nearest neighbors to the student are the other students in the same room. Since the student is standing close to Room 2, the KNN algorithm predicts that the student is likely a boy and therefore belongs in Room 2 with the other boys.

Therefore, for any new student that appears in the hallway, the room closest to their location houses their nearest neighbours based on their classification, which is where they belong. The student is the ‘k’.

The ‘k’ in KNN

In the context of k-Nearest Neighbors (k-NN) algorithm, “k” is not an acronym. It is simply a parameter that represents the number of nearest neighbours to consider when making predictions for a new data point.

In KNN, the algorithm searches for the K — nearest data points in the training set to the new data point(test data) based on a distance metric, such as Euclidean distance, and then use their labels to predict the label or value of the new data point. In a simpler explanation, knowing the value of K requires calculating the closest labels to it. The value of K is a hyperparameter that needs to be specified before training the model. You can refer to the literal example above for better understanding.

Choosing an appropriate value of K is important as it can have a significant impact on the performance of the KNN algorithm. A smaller value of K will result in a more complex model, which may lead to overfitting and poor generalization performance on new, unseen data. On the other hand, a larger value of K can help to reduce the impact of noisy or irrelevant features in the dataset, resulting in a smoother decision boundary that is less likely to overfit, a simpler model, but may result in underfitting and poor performance on the training data.

The optimal value of K depends on the nature of the dataset, the problem being solved, and the performance metrics used for evaluation. Typically, the value of K is chosen by performing a grid search or cross-validation on a range of values and selecting the value that gives the best performance on the validation set.

Determining The Value Of ‘K’

One common approach is to use a grid search, which involves evaluating the model’s performance using different values of K and selecting the value that yields the best performance on a validation set. The process involves dividing the dataset into training, validation, and test sets. The training set is used to train the KNN model, while the validation set is used to tune the hyperparameters, such as K. Finally, the test set is used to evaluate the performance of the final model.
Another approach is to use cross-validation, which involves partitioning the dataset into K folds, training the model on K-1 folds, and evaluating its performance on the remaining fold. We repeat this process K times, with each fold serving as the validation set once, and average the performance metrics across the K folds. We select the value of K that yields the best average performance across the K folds as the optimal value.

Ultimately, the appropriate value of K depends on the specific characteristics of the dataset, the problem being solved, and the performance metrics used for evaluation. Therefore, it is essential to experiment with different values of K and evaluate their performance using appropriate metrics to select the optimal value.

Limitations To Using KNN Algorithm

Computationally expensive: KNN needs to calculate the distance between the query point and all the training points in the dataset, which can be computationally expensive, especially for large datasets. This can lead to slow training and testing times.
Sensitive to feature scaling: KNN is sensitive to the scale of the features, which means that features with larger scales can dominate the distance calculation, leading to biased results. Therefore, it’s important to normalize or scale the features before applying KNN.
Sensitive to outliers: KNN is sensitive to outliers, which can greatly affect the distance calculation and lead to inaccurate predictions. Outliers can have a significant impact on the nearest neighbors, especially when k is small.
Dimensionality problem: KNN can suffer from the curse of dimensionality, which means that an increase in the number of features or dimensions is directly proportional to an exponential increase in the volume of the space, and the number of training points required to make accurate predictions, which can lead to overfitting or underfitting.
Requires optimal value of k: The choice of the optimal value of k can be difficult, and it can greatly affect the performance of the algorithm. A value of k that is too small can lead to overfitting, while a value that is too large can lead to underfitting.
Class imbalance: KNN can perform poorly when dealing with class imbalance, where one class has significantly more samples than the other class. In such cases, the majority class can dominate the nearest neighbors, leading to biased predictions.

Overall, KNN can be a useful algorithm to address certain types of problems, but it is most practical to consider the limitations of KNN algorithms, in relation to the problem you are trying to solve, before applying it to your datasets.

Step — by — Step Implementation of a KNN Classifier Model

Below is a dummy step-by-step example of implementing a KNN classifier.

Step 1: Import the necessary Python packages.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # For data visualization

Step 2: Load your data

dataset = pd.read_csv(dataset.csv)

# To display the first few rows of your data
dataset.head()

Step 3: Data preprocessing — This step depends on the dataset you are working with. In some cases, the raw data may be clean, well-structured, and not require much preprocessing. However, in most cases, the raw data is noisy, incomplete, and may contain irrelevant features, and preprocessing is necessary to make the data suitable for machine learning.

If the data contains missing values or outliers, then data cleaning may be necessary to handle these issues. If the data has many features, then data reduction techniques may be necessary to reduce the dimensionality of the data and improve the model’s performance.

Step 4: Define your feature and target variables.

X = data[:, :-1].values  # Features
y = data[:, -1].values  # Target

Step 5: Split your dataset into training and test data by specifying the percentage of data to be used for training and testing.

from sklearn.model_selection import train_test_split
X-train, y_train, X_test, y_test = train_test_split(X,y, test_size = 0.2 random_state = 42, stratify = y)

Step 6: Train the KNN model by implementing it.p

from sklearn.neighbors import KNeighborsClassifier
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

Step 7: Now we can make predictions

y_pred = knn.predict(X_test)

Step 8: Evaluate the model on test set

accuracy = knn.score(X_test, y_test)
print (accuracy)

Step 9: Print the evaluation metrics of the model against the dataset

from sklearn.metrics import confusion_matrix, classification_report
confusion_result= confusion_matrix(y_test, y_pred)
print("Confusion Matrix: " + confusion_result)

classification_result = classification_report(y_test, y_pred)
print("Classification Report: " + classification_result)

In Conclusion,

KNN algorithm can be applied in image processing tasks, such as object detection, and image segmentation, to identify different parts of an image such as faces or buildings by comparing them to the features of their nearest neighbors, for example.

It can also be applied in Natural Language Processing, to classify news articles into different categories based on their textual features. Examples in text classification, and sentimental analysis.
They can also be applied to other tasks, such as anomaly detection, recommendation systems, e.t.c