K-Nearest Neighbor(KNN) Algorithm in Machine Learning

11 min readJun 21, 2023

The K-Nearest Neighbors (KNN) technique is a well-known supervised machine-learning algorithm that can be used for classification and regression tasks. It predicts data points based on their similarity.

The “K” in K-NN refers to the number of nearest neighbors to take into account while making a prediction. The method operates by comparing a new, unlabeled data point to the labeled data points in the training dataset. It finds the K nearest neighbors using a distance metric, such as Euclidean distance or Manhattan distance, which measures the similarity of data points.

For classification problems, K-NN assigns the class label to the new data point based on the majority vote of its K nearest neighbors. For example, if the majority of the K nearest neighbors are from class A, the algorithm predicts that the new data point will be from class A as well.

In regression problems, K-NN predicts the numerical value of the target variable for the new data point by taking the average or weighted average of the target values of its K nearest neighbors.

The choice of K is a crucial parameter in K-NN. A small value of K (e.g., K=1) can lead to overfitting, where the algorithm becomes sensitive to noisy data points. On the other hand, a large value of K can lead to underfitting, where the algorithm may overlook local patterns in the data. Hence, selecting an appropriate value of K is essential to achieve optimal performance.

How does K-NN function?

Step 1: Select the number K of neighbors.

Step 2: Calculate the Euclidean distance between K neighbors.

Step 3: Determine the K closest neighbors based on the estimated Euclidean distance.

Step 4: Count the number of data points in each category among these k neighbors.

Step 5: Assign the new data points to the category with the greatest number of neighbors.

Step 6: Our model is ready.

Suppose we have a new data point, and we need to put it in the required category. Consider the below image:

Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:

By calculating the Euclidean distance, we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:

As we can see the 3 nearest neighbors are from Category A, hence this new data point must belong to Category A.

Distance Metrics Used in K-NN Algorithm

As we know, the KNN algorithm can assist us find the closest points or groups to a query point. However, we need a measure to find the closest groups or points to a query point. We use the distance measures listed below:

· Euclidean Distance

· Manhattan Distance

· Minkowski Distance

· Hamming Distance

· Jaccard Distance

Euclidean Distance: Euclidean distance is a measure of the straight-line distance between two points in a Euclidean space. In mathematics, it is calculated using the Pythagorean theorem.

Let’s consider two points in a Euclidean space, represented as P = (x₁, y₁, z₁, …, n₁) and Q = (x₂, y₂, z₂, …, n₂). Here, (x₁, y₁, z₁, …, n₁) and (x₂, y₂, z₂, …, n₂) are the coordinates of the two points in n-dimensional space.

The Euclidean distance, denoted as d(P, Q), between these two points, is calculated as follows:

d(P, Q) = sqrt((x₂ — x₁)² + (y₂ — y₁)² + (z₂ — z₁)² + … + (n₂ — n₁)²)

In other words, the Euclidean distance is the square root of the sum of the squared differences between the corresponding coordinates of the two points.

For example, in a two-dimensional space (n = 2), the Euclidean distance between points P(x₁, y₁) and Q(x₂, y₂) can be calculated as:

d(P, Q) = sqrt((x₂ — x₁)² + (y₂ — y₁)²)

This formula represents the length of the straight-line segment connecting the two points.

The Euclidean distance is commonly used as a distance metric in various applications, including machine learning algorithms like KNN, where it measures the similarity or dissimilarity between data points based on their feature values.

It’s important to note that the Euclidean distance assumes that the dimensions or features are continuous and numerical. If the data contains categorical or binary features, other distance metrics such as Hamming distance or Jaccard distance may be more appropriate.

Manhattan Distance: Manhattan distance, also known as the L1 distance or the city block distance, is a metric used to measure the distance between two points in a grid-like space. It calculates the sum of the absolute differences between the coordinates of the two points along each dimension.

Let’s consider a simple example with two points, A and B, in a two-dimensional grid:

Point A: (2, 3), Point B: (5, 7)

To calculate the Manhattan distance between A and B, we compute the absolute differences between their coordinates along each dimension and sum them up:

Manhattan distance = |2–5| + |3–7| = 3 + 4 = 7.

In this case, the Manhattan distance between A and B is 7. It represents the minimum number of steps required to move from point A to point B, considering only horizontal and vertical movements, as if navigating through a city block grid.

The Manhattan distance can also be extended to higher-dimensional spaces. For example, consider two points in a three-dimensional space:

Point A: (2, 3, 1), Point B: (5, 7, 2)

Manhattan distance = |2–5| + |3–7| + |1–2| = 3 + 4 + 1 = 8.

In this case, the Manhattan distance between A and B is 8, taking into account the absolute differences along each dimension.

Minkowski Distance: The Minkowski distance is a generalized distance metric that encompasses both the Euclidean distance and the Manhattan distance as special cases. It allows for flexibility in adjusting the level of emphasis on different dimensions. The Minkowski distance is defined as:

d(x, y) = (∑ᵢ(|xᵢ — yᵢ|ᵖ))^(1/p),

where x and y are two vectors of equal length representing points in an n-dimensional space, |xᵢ — yᵢ| represents the absolute difference between the corresponding elements of x and y, and p is a parameter controlling the level of emphasis on different dimensions.

When p = 1, the Minkowski distance reduces to the Manhattan distance:

d(x, y) = ∑ᵢ(|xᵢ — yᵢ|).

When p = 2, the Minkowski distance reduces to the Euclidean distance:

d(x, y) = (∑ᵢ((xᵢ — yᵢ)²))^(1/2).

Let’s consider an example to illustrate the calculation of the Minkowski distance. Suppose we have two points, A and B, in a two-dimensional space:

Point A: (2, 3), Point B: (5, 7)

Let’s calculate the Minkowski distance between A and B for different values of p:

p = 1 (Manhattan distance):

d(A, B) = |2–5| + |3–7| = 3 + 4 = 7.

p = 2 (Euclidean distance):

d(A, B) = ((2–5)² + (3–7)²)^(1/2) = √(9 + 16) = √25 = 5.

p = 3:

d(A, B) = (|2–5|³ + |3–7|³)^(1/3) = ∛(-27 + 64) = ∛37 ≈ 3.303.

In this example, we can observe how different values of p affect the resulting Minkowski distance between the two points. A higher value of p (greater than 1) places more emphasis on larger differences between coordinates, whereas a lower value of p (less than 1) gives more importance to smaller differences.

The Minkowski distance is a versatile distance metric used in various machine learning algorithms and applications, particularly when dealing with data in high-dimensional spaces where different dimensions may have varying levels of importance.

Hamming Distance: Hamming distance is a metric commonly used to measure the similarity or dissimilarity between two binary feature vectors. It is particularly useful when dealing with categorical or binary data, where each feature can take only two possible values.

Mathematically, let’s consider two feature vectors, x, and y, each consisting of n binary features. We can represent each feature vector as a binary array, where x = [x₁, x₂, …, xₙ] and y = [y₁, y₂, …, yₙ], where xᵢ and yᵢ represent the i-th features of vectors x and y, respectively.

The Hamming distance between x and y, denoted as d(x, y), can be calculated using the following formula:

d(x, y) = ∑ᵢ(xᵢ ⊕ yᵢ),

where ⊕ denotes the element-wise XOR operation. This means that the Hamming distance is the sum of the bitwise differences between the corresponding features of the two vectors.

To clarify, the XOR operation (⊕) returns 1 if the bits being compared are different and 0 if they are the same. Therefore, summing up all the XOR results gives us the count of positions at which the two strings differ.

Let’s consider an example to illustrate this:

A = 101101, B = 100111

To calculate the Hamming distance between A and B, we compare the corresponding bits of the two strings:

d(A, B) = (1 ⊕ 1) + (0 ⊕ 0) + (1 ⊕ 0) + (1 ⊕ 1) + (0 ⊕ 1) + (1 ⊕ 1) = 0 + 0 + 1 + 0 + 1 + 0 = 2.

In this case, the Hamming distance between A and B is 2, indicating that the two strings differ at two positions.

Jaccard Distance: The Jaccard distance is a measure of dissimilarity between two sets. It is defined as the size of the symmetric difference of the sets divided by the size of their union. Mathematically, the Jaccard distance between sets A and B is calculated as:

J(A, B) = 1 — |A ∩ B| / |A ∪ B|

where:

A ∩ B represents the intersection of sets A and B (the elements that are common to both sets).

A ∪ B represents the union of sets A and B (all the unique elements from both sets).

|A| denotes the cardinality (size) of set A.

To illustrate with an example, let’s consider two sets A and B:

A = {1, 2, 3}, B = {2, 3, 4, 5}

First, we calculate the intersection and union of the sets:

A ∩ B = {2, 3}

A ∪ B = {1, 2, 3, 4, 5}

The size of the intersection, |A ∩ B|, is 2, and the size of the union, |A ∪ B|, is 5. Therefore, we can calculate the Jaccard similarity coefficient (J(A, B)):

J(A, B) = |A ∩ B| / |A ∪ B|

= 2 / 5

= 0.4

Finally, we can calculate the Jaccard distance (J) using the formula:

J(A, B) = 1 — J(A, B)

= 1–0.4

= 0.6

So, in this example, the Jaccard distance between sets A and B is 0.6, indicating a relatively high level of dissimilarity between the sets.

Advantages of K-NN:

Simplicity: K-NN is a simple and easy-to-understand algorithm. It does not make any assumptions about the underlying data distribution, making it a non-parametric algorithm.

No training phase: K-NN is a lazy learning algorithm, meaning it does not explicitly build a model during the training phase. It memorizes the training data and performs computations at the time of prediction. This can be beneficial when the training data is frequently updated.

Versatility: K-NN can be applied to both classification and regression tasks. It can handle multi-class classification problems and can adapt to different types of data.

Robust to outliers: K-NN is relatively robust to noisy data and outliers because it considers the local neighborhood of the data points for prediction.

Interpretable results: K-NN provides transparency in the decision-making process. The prediction is based on the nearest neighbors, allowing users to interpret and understand the reasons behind the classification or regression outcomes.

Disadvantages of K-NN:

Computationally expensive: The K-NN algorithm can be computationally expensive, especially when dealing with large datasets. As it requires calculating distances between the new example and all training examples, the time complexity grows linearly with the number of training instances.

Memory requirements: Since K-NN memorizes the entire training dataset, it requires a significant amount of memory to store the training examples. As the dataset grows larger, the memory requirements also increase.

Sensitivity to feature scaling: K-NN calculates distances between data points, and the choice of distance metric can be sensitive to the scale of features. It is crucial to perform feature scaling or normalization to ensure that all features contribute equally to the distance calculations.

Determining the optimal value of K: The selection of the value for K, the number of neighbors, is critical. Choosing an inappropriate K value can lead to overfitting or underfitting. It requires careful tuning and validation to find the optimal K for a given problem.

Imbalanced data: K-NN can be biased towards the majority class in imbalanced datasets. Since it considers the majority vote of the K nearest neighbors, it may struggle to accurately predict minority class instances.

Best case scenario where the K-NN algorithm should be used:

The K-Nearest Neighbours (K-NN) technique is useful for classification and regression tasks. Here are some examples of frequent applications where KNN can be useful:

Classification: K-NN is commonly used for classification tasks in which the goal is to assign labels to new cases based on their resemblance to labeled examples. It is used in spam detection, sentiment analysis, document categorization, image recognition, and recommendation systems.

Anomaly Detection: K-NN can be used to detect outliers or anomalies in data. It can discover data points that differ significantly from the majority by computing the distance of an instance to its K nearest neighbors. This is useful for detecting fraud, detecting network intrusions, and identifying defective products in manufacturing.

Collaborative Filtering: K-NN can be used in recommendation systems to identify related persons or objects based on their features or ratings. It can propose things or users of interest to a given user by locating the nearest neighbors, allowing for personalized recommendations.

Regression: While K-NN is best recognized for classification, it may also be used for regression tasks. K-NN can estimate continuous values instead of assigning labels by averaging or weighting the values of the K nearest neighbors. It can be used for predicting house prices, stock market trends, and any other numerical prediction problem.

Bioinformatics: K-NN has uses in bioinformatics, where it can help with gene expression analysis, protein-protein interaction prediction, and disease diagnosis based on patient data.

Geospatial Analysis: K-NN is used in geospatial applications such as predicting real estate prices based on adjacent facilities, analyzing crime patterns, and classifying land use using remote sensing data.

Recommender Systems: K-NN is extensively used in collaborative filtering-based recommender systems. It can recommend relevant products, movies, music, or articles to users by detecting comparable users or items based on their preferences or characteristics.

Conclusion:

The K-Nearest Neighbors (K-NN) algorithm is a popular and versatile machine learning algorithm used for both classification and regression tasks. It operates based on the concept of finding the K nearest neighbors to a given data point and making predictions or decisions based on their characteristics.