An Overview of K Nearest Neighbor

Sandeep Panchal

Published in

Analytics Vidhya

6 min readDec 10, 2019

*** My warm welcome to all the readers! ***

1. Introduction to KNN

K Nearest Neighbor in short KNN is one of many machine learning algorithms. KNN comes under the category of supervised learning. It is both a classification and a regression algorithm. It works on the basic idea of “classifying based on nearest neighbors of the majority class.” Here, ‘K’ is the natural number and preferably odd number is chosen. Suppose, we have red balls on one side and blue balls on the other. We are placing a new ball (color unknown and need to be classified) somewhere near red and blue balls. We need to find out whether the new ball belongs to the red ball category or blue ball category. What we now do is, we will take K as 5 nearest neighbors to the new ball. We found out that out of 5 nearest neighbors, 3 are blue balls and 2 are red balls. Based on the majority count, we can classify the new ball as blue.

2 Understanding KNN with Toy Example:

Let’s consider the classification technique (binary classifier).

I am not taking big data set which contains n number of dimensions as it becomes hard to picture n number of dimensions in our mind and understand how algorithm is working. For better and easy understanding, I am creating a toy example.

2.1 Creating Toy Data Frame:

I have created a toy example that contains 10 rows and 2 columns. Column ‘Numbers’ represents an independent variable (x) which has numbers from 1 to 10. Column ‘Label’ represents a dependent variable (y) which has a binary class label ‘0’ and ‘1’.

We can see from the above image that numbers from 1 to 5 belongs to class 0 and numbers from 6 to 10 belongs to class 1. Our objective now is given any number (real or fraction), our model should be able to predict the class of new data point i.e either 0 or 1.

Assigning ‘Numbers’ column as ‘x’ and ‘Label’ column as ‘y’. Below is the code image.

Assigning Columns

2.2 Training Model:

We will now train the model with n_neighbors (K) as 3 and ‘metric’ as ‘euclidean’. K as 3 means we will consider only 3 nearest neighbors and decide whether a given point belongs to class 0 or 1. Here, we are finding nearest neighbors with ‘euclidean’ distance also known as ‘L2 Norm’. It works similar to the Pythagoras theorem. Click on the hyperlink for euclidean distance. Below is the code for the training model.

Note: We can use other distance method as metric like Manhattan distance (L1 Norm), Minkowski distance, Hamming distance, Cosine Distance, etc.

2.3 Predicting Query Point

Here comes the interesting part — predicting a new point. Let’s get into imagination. Just picture in your mind all 10 numbers (1 to 10) where 1–5 belongs to class 0 and 6–7 belongs to class 1. Suppose, I want to predict the class of number ‘5.4’. We will find 3 nearest numbers to 5.4 and those are 4, 5, and 6. Out of 3 nearest points, 2 points belongs to class 0 and 1 belongs to class 1. We can decide the class of 5.4 based on 2 methods. 1- Majority-based and 2- Probability-based.

1- Majority-based: Majority points (2) belong to class 0 and minority point(s) (1) belong to class 1. Hence, number 5.4 will be classified as 0.

2- Probability-based: Probability of number 5.4 belonging to class 0 is 0.66666667 and belonging to class 1 is 0.33333333. Based on the highest probability, number 5.4 will be classified as 0.

3 Selection of K value:

3.1 Odd K value:

As I mentioned above, the K value is preferably taken an odd number, and there is a good and convincing reason for it. If K value is even number then there is a good chance that K/2 nearest points might belong to one class and other K/2 might belong to another class. Supposing, we have taken K value as 10. Unfortunately, if 5 nearest points belong to class 0 and the other 5 belong to class 1, it will certainly be a great deal to decide to which class new point belongs to. This is the only reason for taking odd K value.

3.2 Right K value with Scikit Packages:

Selecting the right K value plays a big role in model performance. If we have taken the wrong K value, it may lead to overfitting or underfitting problems. How do we choose the right K value then? Well, we have many methods like GridSearchCV, K-Fold Cross Validation, RandomizedSearchCV, For Loop, etc, which we can leverage in finding the right K value.

Note: As K value decreases, model leads to overfitting, underfitting otherwise. So, selection of right K value is a major part.

3.3 Right K value with Graph:

While tuning hyperparameter with different K values with scikit packages, we will certainly have different K values and corresponding accuracy or error. To select the right K value graphically, we need to plot both train and validation accuracy/error. I have plotted train and validation accuracy and we will use it to understand. Below is the graph image.

From the above image, we can see the training curve gradually declines and the validation curve gradually inclines. We should select K value such that both train and validation curves are close to each other at that value. In our case, at K 47, train and validation curves are close to each other. So, we can select K as 47.

Note: If from certain K value say 43, both train and validation curves becomes constant i.e curves neither declines nor inclines. In this scenario, we might get into dilemma in choosing K value as curves are close to each other at more than one point. In this case, we should select the smaller K value i.e if curves are close to each other at K values 43, 45, 47, 49… then we should select K value 43.

4 Applications of KNN:

Recommender System: Like in Amazon and Flipkart to recommend products based on our previous viewed, searched and bought data. In Netflix, IMDB, etc, to recommend movies and web series based on our searched and watched data.
Concept Search: Classifying documents based on similar concepts. There are many more applications though.

5. Summary:

KNN is a supervised learning algorithm and is used both for classification and regression.
It is easy to understand and apply an algorithm with the scikit packages.
If data has a huge dimension, KNN becomes time-consuming. Hence, applying KNN when data has a huge dimension is not advisable.
In most cases, tuning of only K will do great, unlike other algorithms where more than one hyperparameter tuning is necessary.

6 References:

Connect me:

LinkedIn: https://www.linkedin.com/in/sandeep-panchal-682734111/
GitHub: https://github.com/Sandeep-Panchal

*** Thank you all for reading this blog. Your suggestions are very much appreciated! ***