K Nearest Neighbors-Complete Guide

Hema Kalyan Murapaka
5 min readFeb 28, 2023

--

In this Blog, I’ll provide a detailed overview of the K-Nearest Neighbors(KNN) Algorithm, including how it works, its advantages and disadvantages, and some practical examples of how it can be used in real-world scenarios. We will also discuss some common variations of the KNN algorithm and provide tips to achieve better performance. By the end of this post, you will have a solid understanding of the KNN algorithm and how to use it effectively in your machine-learning projects.

The K-Nearest Neighbors(KNN) Algorithm is a popular Supervised Machine Learning algorithm as well as an Unsupervised Algorithm used to solve both Classification and Regression Problems, but generally, it is more widely used in Classification Problems. It is a non-parametric and instance-based learning algorithm, a lazy algorithm. KNN Algorithm Predicts the output based on its similarity of observations on training data.

Here is a step-by-step guide on how to implement the KNN algorithm:

Step 1: Load the Data
Step 2: Initialize the Value of K
Step 3: Perform the following operations for each point in the test data.
Step 3.1: Calculate the distance between the test data and each training data point by using any distance metrics like Euclidean Distance, Manhattan Distance or Hamming Distance.
Step 3.2: Sort the Obtained Distances in the Ascending Order.
Step 3.3: Choose the Top-K Values from the Sorted Distances.
Step 3.4: Assign the most frequent classes from the array and assign the class to the test data.

Let’s Understand Mathematical Logic behind KNN with an Example:

Let’s initialize the value of k (prefer √Number of Records). Here, k will be 3.

Image from Online Source

In this representation, the numerical value 0 is used to indicate males and the numerical value 1 is used to indicate females.
Let’s find the class of the new instance i.e. Angelina whose age is 5

Let’s calculate the distance(in this case we preferred Euclidean Distance) between the new instance and all records present in the dataset. For your understanding, let’s calculate the distance between Angelina and Ajay.

By following this, let’s calculate all the distances

Sort the Obtained Distances in ascending order and choose the initial 3 records. Therefore, we got

As the frequency of Cricket is higher. We consider the class of new instance i.e. Class of Angelina would be Cricket.

Let’s Discuss the syntax to implement KNN:

class sklearn.neighbours.KNeighborsClassifier(n_neighbors=5,
weights='uniform', algorithm='auto', leaf_size=30, p=2,
metric='minkowski', metric_params=None, n_jobs=None)

> n_neighbors = (int), Default: 5
The number of neighbors to use. It refers to the K- Value.

> Weights = (uniform, distance, callable ), Default: uniform
Weight function used in prediction.
The possible Values be uniform: uniform weights. All points in each neighborhood have the same weight. distance: Weight by the reciprocal of the distance. callable: A user-defined function that takes an array of distances and returns an array of the same shape containing weights.

> algorithm = (auto, ball_tree, kd_tree, brute), Default: auto
It is used for the computation of neighbors.

> leaf_size = (int), Default: 30
This parameter can modify the speed of the construction and query, and the memory required to store the tree. The optimal value could be varies based on the nature of the problem.

> metric = (str or callable), Default: ‘minkowski’
It is used for distance computation.

> p = (int), Default: 2
It is the Parameter for the Minkowski metric. if p=1, it uses Manhattan distance and if 2, it uses Euclidean distance.

> metric_params = (dict), Default=None
AN Additional keyword argument for the metric function.

> n_jobs = (int), Default=None
This parameter specifies the number of parallel jobs to be used when conducting a neighbor search.

Refer to this Sklearn-KNeighborsClassifier for further Information.

Let’s dive into Practical implementation

This Procedure Contains Basic Steps such as

  1. Importing Required Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

2. Importing Data

df= pd.read_csv('user_data.csv')  

Let the dataset be

Image from Author

3. Data Pre-processing

In this step, usually, we are going for data cleaning, data transformation, data integration, data reduction, and data transformation. As we have clean data, we are directly going for data assignment for dependent and independent variables.

x= data_set.iloc[:, [2,3]].values  
y= data_set.iloc[:, 4].values

4. Split the Data

from sklearn.model_selection import train_test_split  
x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.25, random_state=0)

5. Train the Model

from sklearn.neighbors import KNeighborsClassifier  
classifier= KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2 )
classifier.fit(x_train, y_train)

6. Test the Model

y_pred= classifier.predict(x_test)  

7. Model Evaluation

from sklearn.metrics import classification_report, accuracy_score

Classification Report = classification_report(y_test, y_pred)

Accuracy = accuracy_score(y_test,y_pred)

Advantages:

  1. That is easy to put into practice.
  2. It can withstand noisy training data.
  3. If there is a lot of training data, it could work better.

Disadvantages:

  1. K’s value must always be determined, and sometimes that can be difficult.
  2. The high computing cost is caused by the need to determine the separation between each data point for each training sample.

If you learned something new or enjoyed reading this article, please clap it up 👏 and share it so that others will see it. Feel free to leave a comment too.

Follow me on:

Email: kalyanmurapaka274@gmail.com

LinkedIn: https://www.linkedin.com/in/hema-kalyan-murapaka-3048b422b

Instagram: https://www.instagram.com/im_kalyan_274

Twitter: https://twitter.com/HemaKalyan26

--

--