k-Nearest Neighbors for Dummies

Anuj Shrivastav

Published in

Analytics Vidhya

5 min readMar 2, 2020

You’re the average of the five people you spend the most time with

Let’s start with the core idea of K-Nearest Neighbors (abbreviated as kNN) .

Given a query point xₜ , we’ll find k-nearest neighbors of that point in the given data set and predict a class yₜ which is done by majority voting of classes of k-nearest neighbors. Same thing extends for Regression , but instead of majority voting, we take either mean or median of k-nearest neighbors.

Failure cases of kNN :

Case 1: When positive and negative classes are randomly distributed

Case 2: When query point xₜ is far away from either of the classes . In this case, we can’t be sure which class does xₜ belong to.

How to decide whether a point is far away from either of the classes? What should be the threshold?

Well, deciding the threshold is data specific but an acceptable way of deciding would be -

Find 1st nearest neighbor distance for each point in our training data.
Sort all the distances
Find , let’s say 90ᵗʰ or 95ᵗʰ percentile value (p’) (or any other percentile value as per requirement).
Now, if 1st nearest neighbor of query point xₜ is greater than p’ , then we can say that xₜ is far from each of the classes.

How do we find out the neighbors of a given point ?

It’s same as finding the neighbors in real life. People who live close to your house are your neighbors. Similarly, data points which are close to the given point will be its neighbors. This closeness is defined by distance between the data points. There are lot of distance measures used in Machine Learning . You can read about them at https://towardsdatascience.com/how-to-measure-distances-in-machine-learning-13a396aa34ce . Generally, Euclidean Distance is used in kNN.

Time and Space Complexity for simple implementation of kNN

Given a training dataset X with n datapoints and d dimensions.

For a query point xₜ , we have to find distance with each of the points in training data.

For 1 distance calculation , we have to perform d operation (d = dimensions of the data)

∴ for n distance calculations , we will have to perform n*d operations or O(n*d) time

Then we have to pick nearest k points based on all the distances. Generally , k is very small as compared to size of dataset . Mostly, k =3 , 5 , 11 , 15 or so , and hence it would take ≈ O(n) time

Then we have to decide using majority voting , which would take constant or O(1) time.

∴ Time Complexity of KNN will be : O(n*d) + O(n) + O(1) = O(n*d)

When we say we have trained a kNN model , it simply means we have loaded the training data into the RAM. No operation is performed during the training phase and hence Space Complexity of kNN will be : O(n*d)

Decision Surface for kNN as k changes

As k increases , smoothness of curve increases.

What if k=n (number of datapoints) ?

In that case, there won’t be any decision surface and whole of the dataset would be taken into consideration. Given a query point, its class is decided by majority class in the training data. That would be a highly underfitting situation.

Overfitting is basically when we have found a function so well that it even takes every minute details into consideration including noisy outliers.The model will give approximately no error on training data , but will fail when it comes to unseen data. It happens when k in kNN is small.

Underfitting is basically when we have found a function which loosely fit to the data or in other words , is an imperfect way to describe the data. It will give error both on training and unseen data. It happens when k in kNN is large.

Finding the right value of k

So we have seen how the value of k affects our model. But how should we determine the value of k?

In practice , it is done using Cross-Validation , which could be a topic for my next article , but in a nutshell , we split our dataset into three parts : Train Data , CV (Cross-Validation) Data and Test Data. We train our model over the Train Data and for different values of k , accuracy is calculated over the CV Data. The k which gives maximum accuracy(or minimum error) is selected as final k value . After that, we test our model taking the final k value over the Test Data.

Weighted kNN

Sometimes we want that more weightage should be given to nearer points than to further away points from the query point. Hence we give weights to each of the k-nearest neighbors of the query point.

One way to assign weights is using inverse distance function. Let’s understand this using an example: