K-Nearest Neighbor Algorithm (KNN)

Farheenshaukat
4 min readMay 3, 2023

--

K-Nearest Neighbor algorithm is a simple and easy-to-understand machine learning algorithm used for classification and regression analysis. It is a supervised learning algorithm that is based on the principle of finding the k closest data points in the training set to a new data point and then using the majority class or average value of the k-nearest neighbors to classify or predict the value of the new data point.

To classify a new data point, the algorithm calculates the distance between the new data point and all the data points in the training set, then it selects the k-nearest neighbors using a distance metric such as Euclidean distance, Manhattan distance, or Minkowski distance. Finally, it assigns the new data point the class or value that is most common among the k-nearest neighbors.

Firstly, we will choose the number of neighbors, so we will choose the k=5.

Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as:

By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image:

As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A.

Finding the Best Value of k:

One of the most important parameters in KNN is the value of k. The value of k determines the number of neighbors that are considered when classifying a new data point. A small value of k can lead to overfitting, while a large value of k can lead to underfitting. Therefore, it is important to find the optimal value of k that gives the best performance.

One way to find the best value of k is to use the elbow method. This method involves plotting the accuracy of the model against different values of k and selecting the value of k where the accuracy starts to plateau. Another method is to use cross-validation. This involves splitting the dataset into several smaller subsets, training the model on each subset, and evaluating its performance. The value of k that gives the best average performance across all the subsets is selected as the optimal value of k.

Advantages of KNN Algorithm:

  • It is simple to implement.
  • It is robust to the noisy training data
  • It can be more effective if the training data is large.

Disadvantages of KNN Algorithm:

  • Always needs to determine the value of K which may be complex some time.
  • The computation cost is high because of calculating the distance between the data points for all the training samples.

Implementing KNN:

Implementing KNN in Python is relatively straightforward. We start by importing the necessary libraries such as NumPy, Pandas, and Scikit-learn. Then we load the dataset and split it into training and testing sets. We then use Scikit-learn’s KNeighborsClassifier() function to create the KNN model and train it using the training set. Finally, we evaluate the model’s performance using the testing set.

Here is a sample code for implementing KNN in Python:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load the dataset
data = pd.read_csv('dataset.csv')
# Split the dataset into training and testing sets
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Create the KNN model and train it
k = 5
classifier = KNeighborsClassifier(n_neighbors=k)
classifier.fit(X_train, y_train)
# Evaluate the model's performance
accuracy = classifier.score(X_test, y_test)
print("Accuracy:", accuracy)

Conclusion:

K-Nearest Neighbor algorithm is a simple and powerful machine learning algorithm used for classification and regression analysis. It works by finding the k closest data points in the training set to a new data point and using the majority class or average value of the k-nearest neighbors to classify or predict the value of the new data point. Implementing KNN in Python is straightforward, and finding the optimal value of k can be done using the elbow method or cross-validation. I hope this article has given you a basic understanding of how KNN works and how to use it in your machine learning projects.

If you have any questions or comments, please feel free to leave them below.

Thanks

--

--