Analytics Vidhya
Published in

Analytics Vidhya

k-nearest neighbours


KNN is a simple and intuitive Machine Learning Algorithm. It can be used for both classification and regression. It is a sort of Supervised Learning where we get both x and y.

KNN is Non-Parametric which means it makes no assumptions so accuracy depends on the quality of the data. It is easy to use as well as has a quick calculation time.

Yet it it certain shortcomings like it is poor at classifying data points at boundary. We have to find the optimal value of k (number of nearest neighbours on which we will be predicting).

We have black point as test point and we have to predict the class it belongs to!! So we calculate it’s Euclidian distance from every point then find k minimum distances (nearest neighbours ) . After that we will check for the majority vote for each class . Class nearest will be the class of test point.

This is the most brute force algorithm, any other algorithm you use should be better than this (more accuracy and less complexity). There is no training involved in this prediction. All the work happens at the query time.

Lets code KNN

def dis(x1,x2):
return np.sqrt(sum((x1-x2)**2))
def knn(X,Y,query,k=5):
val = []
m = X.shape[0]

for i in range(m):
d = dis(query , X[i])

new_val = np.unique(val[:,1],return_counts=True)

max_freq_index = new_val[1].argmax()
pred = new_val[0][max_freq_index]

return pred

Step by step Explaination of what we did in this code!!

  1. Load the data
  2. Initialize K to your chosen number of neighbors

3. For each example in the data

3.1 Calculate the distance between the query example and the current example from the data.

3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels.

How we can choose value of k?

The optimal K value usually found is the square root of N, where N is the total number of samples.

To gets your hands on knn you can apply it MNIST Dataset !!



Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store