A Beginner’s Guide to K Nearest Neighbor(KNN) Algorithm With Code

Jijo Abraham
Analytics Vidhya
Published in
6 min readSep 21, 2019

Today, lets discuss about one of the simplest algorithms in machine learning: The K Nearest Neighbor Algorithm(KNN). In this article, I will explain the basic concept of KNN algorithm and how to implement a machine learning model using KNN in Python.

Machine learning algorithms can be broadly classified into two:

1. Supervised Learning

2. Unsupervised Learning

In supervised learning, we train our models on a labelled set of data and ask it to predict the label for an unlabeled point. For example, a cancer prediction model is trained on many clinical test results that are labelled as either positive or negative. The trained model can then predict whether an unlabeled test result is positive or negative.

On the other hand, unsupervised learning is done on unlabeled data set.

KNN belongs to supervised learning. Having said that, we will keep aside machine learning and KNN for the next 5 mins and lets go to Bob’s new fruit store in town. As a promotional offer, Bob has set up a puzzle and is giving free fruits to those who solve it. The puzzle is as explained below.

Bob’s puzzle

Bob has arranged 4 types of fruits: Apples, Oranges, Strawberries and Grapes as shown in above image. He has covered some of the fruits with a black cloth and numbered them. Whoever can correctly predict the name of these hidden fruits can take them for free! But making a wrong prediction will make you lose all you have got. If you are unsure, you can mark it as unsure. Now take a PAUSE and try if you can get it all right.

Once you have made the predictions, lets cross check it with that of below:

1,2,3 → Oranges

4 →Unsure if its Orange or Apple

5,6 →Apple

7 →Strawberry

8 →Unsure if its Grape or Strawberry

9 →Grape

10 →Unsure if its Apple or Grape

If your predictions matches with that of above, you already know what is KNN and have implemented it! Wonder how? To understand that, lets have a closer look at how you made the predictions.

In the image, we observe that similar fruits are arranged together. For 1, 2 and 3, we can easily classify them as Oranges since they are densely surrounded by Oranges alone and thus there is a high probability that the hidden ones could also be Oranges. To put it in other words, the hidden ones will mostly be the same type as that of majority of their neighbors. The same logic applies for 5,6,7 and 9.

For 10, we are unsure if its an Apple or a Grape. This is because, its surrounded by both Apples and Grapes. Or we can say that neighbors of 10 belongs to Apples and Grapes. So 10 can be either an Apple or a Grape. Similarly for 4 and 8.

In short, KNN algorithm predicts the label for a new point based on the label of its neighbors. KNN rely on the assumption that similar data points lie closer in spatial coordinates. In above example, based on the label(Apples, Oranges, Strawberries, Grapes) of the neighbors we can predict the label for a new data point(hidden fruit).

Nearest Neighbor

K in KNN is the number of nearest neighbors we consider for making the prediction. We determine the nearness of a point based on its distance(eg: Euclidean, Manhattan etc)from the point under consideration. For example, if K=5, we consider 5 nearest points and take the label of majority of these 5 points as the predicted label.

Lets examine how the neighbors are estimated for our previous example. Consider the above figure. Our aim is to predict the label for the point marked as X. If K=3, out of 3 neighboring points of X, 2 are strawberries and 1 is orange. So we predict the label for X as strawberry. If K=5, out of 5 neighboring points of X, 3 are oranges and 2 are strawberries. So we predict the label for X as Orange.

From above example, we can see that as K varies, the predicted label differs. Thus K is the hyper parameter for KNN that is to be tuned to find the optimal value. On the labelled train data, we experiment with different values of K and choose the K value that gives the best result. Once the K value is fixed, this value can later be used for predicting unlabeled data points.

Let’s Code It…

We will implement the KNN model on iris data set. Iris data set consist data of 3 species of iris flowers namely Setosa, Versicolour and Virginica. Each data point has 4 features and a label(species) associated with it. The features are
sepal_length, sepal_width, petal_length, petal_width. Based on these features, we need to predict the output label i.e the species of the flower.

You can download the iris data set from below link: https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv

Once the download is completed, load the dataset into your Python code. I have used Jupyter Notebook for coding.

import pandas as pd
import numpy as np
iris=pd.read_csv('iris.csv')

Below is a sample of the data set.

iris[20:25]
Sample data from iris data set

We need to first separate the species(label) column from the data set.

X=iris.drop(columns=['species'])
Y=iris['species']

X contains the features and Y contains the species.

Input features and Output labels

In machine learning, we train our model on the train data and tune the hyper parameters(K for KNN)using the models performance on cross validation(CV) data. So lets split the data into train and CV data sets using the train_test_split() function in sklearn library. You can read more about it from this link.

from sklearn.model_selection import train_test_split
X_train,X_val,y_train,y_val=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=20)

Since KNN works based on distance between data points, its important that we standardize the data before training the model. Standardization helps in avoiding problems due to scale. We use StandardScaler() function from sklearn for data standardization. You can read more about it from this link.

scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_val)

Now let’s train our KNN model using a random K value, say K=10. That means we consider 10 closest neighbors for making a prediction. Thanks to sklearn, that we can train the KNN model with just 3 lines of code!!! You can read more about the KNeighborsClassifier() function in sklearn from this link

from sklearn import neighbors
KNN_model=neighbors.KNeighborsClassifier(n_neighbors=best_k,n_jobs=-1)
KNN_model.fit(X_train,y_train)

Lets check how well our trained model perform in predicting the labels of the cross validation data.

pred=KNN_model.predict(X_val)
print("Accuracy={}%".format((sum(y_val==pred)/y_val.shape[0])*100))

Output: Accuracy=90.0%

That means the model can accurately predict labels for 90% of the data points. Lets check if this accuracy can be improved by tuning the hyper parameter K for its optimal value.

In below code snippet, for each K value the model performance is evaluated using the F1-Score. F1-Score is a performance metric used for evaluating the model. Value of F1-Score is in range 0–1. The model performance increases with increase in F1-Score. You can read more about F1-Score from this link.

from sklearn import neighbors 
from sklearn.metrics import f1_score,confusion_matrix,roc_auc_score
f1_list=[]
k_list=[]
for k in range(1,10):
clf=neighbors.KNeighborsClassifier(n_neighbors=k,n_jobs=-1)
clf.fit(X_train,y_train)
pred=clf.predict(X_val)
f=f1_score(y_val,pred,average='macro')
f1_list.append(f)
k_list.append(k)

Lets plot the F1-Score Vs K value graph.

Let’s get the best K value that gives the maximum F1-Score

best_f1_score=max(f1_list)
best_k=k_list[f1_list.index(best_f1_score)]
print("Optimum K value=",best_k," with F1-Score=",best_f1_score)

Output: Optimum K value= 3 with F1-Score= 1.0

Now let’s predict using the best K value i.e. K=3 and check the accuracy

KNN_model=neighbors.KNeighborsClassifier(n_neighbors=3,n_jobs=-1)
KNN_model.fit(X_train,y_train)
pred=KNN_model.predict(X_val)
print("Accuracy={}%".format((sum(y_val==pred)/y_val.shape[0])*100))

Output: Accuracy=100.0%

Thus we can see that we get 100% accuracy by using the optimal K value. Please note that we got 100% accuracy, since the data set we used is simple. We may not get such high accuracy for real life data sets which are much more complex.

Advantages of KNN

  • No training time is required
  • Its simple and easy to implement
  • New data points can be added to the train data set at any time since model training is not required.

Disadvantages of KNN

  • Require feature scaling
  • Does not work well when the dimensions are high.
  • Sensitive to outliers
  • Prediction is computationally expensive as we need to compute the distance between the point under consideration and all other points.

--

--