K-Nearest Neighbor (KNN)-Using Python

Dharmaraj
3 min readMay 23, 2022

--

Introduction

The KNN algorithm is a supervised machine learning model. That means it predicts a target variable using one or multiple independent variables. K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a good suite category by using K- NN algorithm.

Formula for KNN

We can use many formulas to solve KNN-based problems. Mostly we use Euclidean Distance to find out the distance between two points.

How do KNN works?

The K-NN working can be explained on the basis of the below algorithm:

  • Select the K value.
  • Calculate the Euclidean distance from K value to Data points.
  • Take the K nearest neighbors as per the calculated Euclidean distance.
  • Among these k neighbors, count the number of the data points in each category.
  • Classify the new data points to that category for which the number of the neighbor is maximum.
  • Model done.

Choosing K value

Choosing K-Value is very important for model accuracy. For this, we can use the Elbow method and the Elbow method is KNN is different from K-Means Clustering. Another method to choose the K value is looping the K value from one to another value and checking model accuracy in every loop. In this blog, we are using the second method.

Implementation

Import packages

import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier

Checking Correlation

data = pd.read_table(‘Fruits_Data.txt’)
data.head()
# calculate the correlation
corr = data.corr()
# plot the heatmap
sns.heatmap(corr, cmap=”Blues”, annot=True)

Split the data and Train the model

X = data[['mass', 'width', 'height']]
y = data['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=3)
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train,y_train)
scr=knn.score(X_test,y_test)
print("Score for your model is ",scr)

Predict the new input data

#1: 'apple', 2: 'mandarin', 3: 'orange', 4: 'lemon'
res=knn.predict([[120,6.0,8.4]])
if res==1:
print("Apple")
elif res==2:
print("Mandarin")
elif res==3:
print("Orange")
else:
print("Lemon")

Find the best K value

k_range = range(1,20)
scores = []
for k in k_range:
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)
scores.append(knn.score(X_test, y_test))
plt.figure()
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.scatter(k_range, scores)
plt.xticks([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]);

The above plot shows the accuracy for each K-value from this K value (3,4,5) giving an accuracy of 75% to our data. So we can choose these are our best K values. Among these three K values, I took value 3.

Click here for the full source code with the dataset.

Have doubts? Need help? Contact me!

LinkedIn: https://www.linkedin.com/in/dharmaraj-d-1b707898

Github: https://github.com/DharmarajPi

--

--

Dharmaraj

I have worked on projects that involved Machine Learning, Deep Learning, Computer Vision, and AWS. https://www.linkedin.com/in/dharmaraj-d-1b707898/