K Nearest Neighbours — Introduction to Machine Learning Algorithms

7 min readJun 11, 2023

Used to solve classification type problems

In the vast realm of machine learning algorithms, few techniques stand as versatile and intuitive as the K-nearest neighbors (KNN) algorithm. If you’re seeking a powerful tool for pattern recognition, classification, and regression tasks, KNN offers a straightforward yet effective approach that leverages the power of proximity.

So, what exactly is the K-nearest neighbors algorithm? At its core, KNN is a supervised machine learning algorithm that excels at classification and regression tasks. The KNN algorithm works based on the idea that similar things are closer to each other.

Imagine you have a dataset of different flowers, each described by their petal length and width. You also have labels indicating whether each flower is a rose or a daisy. Now, you encounter a new flower and want to determine its type. This is where KNN comes into play. Now, In this flower example, KNN looks at the flower’s nearest neighbors to decide its type. Let’s say we set K = 3, meaning we’ll consider the three closest neighbors to make our prediction. To determine the type of the new flower using KNN, here’s what we’ll do:

Measure the distance: Calculate the distance between the new flower and all the flowers in the dataset. In our case, we can use the Euclidean distance, which is like measuring the straight-line distance between two points. The distance is computed based on the petal length and width.
Find the K nearest neighbors: Identify the K flowers with the shortest distances to the new flower. These are the K nearest neighbors. For instance, if K = 3, we select the three flowers that are closest to our new flower.
Majority voting: Among the K nearest neighbors, count how many are roses and how many are daisies. Whichever type has the majority becomes our prediction for the new flower. For example, if two neighbors are roses and one is a daisy, we predict that the new flower is a rose.

So, this is the idea behind How KNN works.

Now we will learn how to implement KNN in our code by taking breast cancer detection dataset from Kaggle in our Jupyter Notebook. Let’s start:

import pandas as pd
import numpy as np

df=pd.read_csv('data.csv')
df.head(5)

The dataset looks like:-

Since, in the dataset two columns are unnecessary occupy the memory, so we need to drop those column.

df.drop(columns=['id','Unnamed: 32'],inplace=True)
df.head()
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(df.iloc[:,1:],df.iloc[:,0],test_size=0.2,random_state=2)

Since, in our dataset columns values are not at same scale, so we need to standardize all the columns values in order to get higher accuracy of our model. Standardization, also known as Z-score normalization, transforms the data so that it has zero mean and unit variance. This is achieved by subtracting the mean of each feature and dividing by its standard deviation. The StandardScaler class in scikit-learn performs standardization.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.transform(x_test)

Now, we need to make object of KNeighborClassifier from scikit-learn library.

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test)
accuracy_score(y_test, y_pred)

And, I got 99.12% accuracy. Here a question arises how to select best value for k in order to get highest accuracy?

Answer is very simple, by performing experiment on each value. Below is the code for that:-

scores = []
for i in range(1,16):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_test)
    scores.append(accuracy_score(y_test, y_pred))

import matplotlib.pyplot as plt
plt.plot(range(1,16),scores)

From above graph, we see that if k value is 3, then accuracy is high. so in this way we select the k value by experiment. Click here for Github link.

Overfitting and Underfitting in KNN

Overfitting: Overfitting happens when a model becomes too complex, capturing noise or irrelevant patterns from the training data. In the context of KNN, overfitting can occur when we set K to a very small value, such as 1 or 2. This causes the model to become overly sensitive to local variations in the training data, leading to poor generalization to unseen data.

For example, consider a case where K = 1. The decision boundary between classes will follow each individual training sample closely, resulting in a model that tries to memorize the training data rather than learning the underlying patterns. This can lead to overfitting, where the model performs well on the training data but fails to generalize to new, unseen examples.

Decision boundary in case of overfitting

2. Underfitting: Underfitting occurs when a model is too simple to capture the underlying patterns in the data. In the case of KNN, underfitting can happen when we set K to a large value, such as the total number of training samples. This results in a model that is too generalized and oversimplifies the underlying patterns in the data.

For instance, if we set K to the total number of training samples, the decision boundary will be smooth and less influenced by individual samples. This might lead to underfitting, as the model may fail to capture local variations and intricate patterns present in the data.

How to avoid Overfitting and Underfitting in case of KNN?

To avoid overfitting and underfitting in KNN, it is important to choose an appropriate value for K. A small K value can lead to overfitting, while a large K value can result in underfitting. The optimal K value depends on the specific dataset and problem at hand. It is common to use techniques such as cross-validation or grid search to find the optimal K value that balances bias and variance, maximizing the model’s performance on unseen data.

Limitations of KNN:-

Large dataset: KNN is a lazy learning technique because in training phase KNN doing nothing, so training is fast but on time of prediction it becomes slow as large dataset comes. Let’s take an example imagine you have a dataset in which 1 lakh rows and 500 columns, in this case on prediction time your model has calculate euclidean distance from 1 lakh points in 500 dimentional dataset, due to which lots of time consume and no one likes to take much more time during prediction.
Curse of Dimensionality: The curse of dimensionality refers to the phenomenon where the feature space becomes increasingly sparse as the number of dimensions (features) grows. In high-dimensional spaces, the notion of proximity or similarity becomes less meaningful. This can negatively impact the performance of KNN, as the algorithm relies on the assumption that similar instances are close to each other. In large datasets with high-dimensional feature spaces, the accuracy and effectiveness of KNN may decrease.
Irrelevant or Outlier Features: In large datasets, it’s common to encounter features that are irrelevant or noisy, meaning they do not contribute much to the underlying patterns or have inconsistent or misleading information. KNN considers all features equally in the distance calculation, which can lead to the inclusion of irrelevant or noisy features in the decision-making process. This can result in suboptimal predictions and decreased performance.
Imbalanced dataset: In an imbalanced dataset, the majority class typically has significantly more samples than the minority class. Since KNN relies on the nearest neighbors for classification, a large number of neighbors from the majority class can overpower the neighbors from the minority class. As a result, the majority class tends to dominate the decision-making process, leading to a bias towards the majority class in the predictions.

Also, KNN is called black box model because of following reasons:-

Lack of Explicit Rules or Parameters: KNN does not learn explicit rules or parameters from the training data like other models such as decision trees or linear regression. Instead, it stores the training instances in memory and makes predictions based on the similarity to neighboring instances. This lack of explicit rules or parameters makes it difficult to understand the inner workings of the model.
No Feature Importance or Coefficients: KNN does not provide feature importance or coefficients that indicate the influence of each feature on the prediction. Since KNN considers all features equally in the distance calculation, it does not offer insights into which features contribute more or less to the final decision.
Lack of Model Summary or Explanation: KNN does not provide a concise summary or explanation of the model’s behavior. Unlike models such as decision trees or linear regression, which can be summarized by inspecting the learned rules or coefficients, KNN does not offer a straightforward way to summarize its behavior or reasoning.

I hope this blog improve your knowledge regarding KNN algorithm.

K Nearest Neighbours — Introduction to Machine Learning Algorithms

Written by Sachinsoni