Introduction to k-Nearest Neighbours

soumik dey
4 min readAug 26, 2020

--

K-Nearest Neighbour(also known as KNN)is a classification technique that falls under Supervised algorithms. In Machine Learning, people face more classification problem than regression. So it’s obvious to know all algorithms regarding that to aware of the advantages and flaws.

Let's discuss Classification first. You have input matrices as well as output vectors, which is generally a Binary or Multiclass classification. So behind the scene, the maths which is going on is this->

you have(x1,x2,x3,…….xn) input values and as well as (y1,y2,….yn) results. You take a train set from the dataset and plot the values and come up with an equation or let's say, model. Then you predict the value by putting the x-value in the equation, check with the test set y-value, and check the accuracy like how efficient your model works. Can your model classify all points in a correct manner or not!!!!

KNN tells if a new data point is given on the dataset, it falls under the correct classification set or not!! It determines the distance from the data point to the other points nearest to it, The closest point determines the given point’s classification. But sometimes it works and sometimes it gives a misleading result. K-value can be 1,3,5,7…etc. Why I am taking odd numbers???? as if say k=4, Then for an unknown data point 2 points of one classification shows some distance and two from other classification shows the same distance, then it will be ambiguous to classify.

A disadvantage of KNN is for a big dataset it works very slow. Its generally a time complexity issue. The time complexity will be O(n*d), where n is the features and d is the dimension. So dimension increases it takes time more and more. The Space issue is also there.

The second Disadvantage is the distance, if the classifier point is at a great distance from the given point then it is not wise to use KNN.

Now let's discuss how to use KNN and apply to our model:

1 Calculate The Distance:

So, there are different techniques to find the distance:

Euclidean distance:

distance from p vector to q vector

so, by applying simple Pythagoras theorem we can find the distance of a new data point.

Manhattan Distance:(Taxicab distance)

So it only shows the difference between the points, just perform the vector subtraction.

Cosine Similarity: It is inversely proportional to cosine distance. The formula is cosine distance=(1-cosine similarity). The formula to determine the cosine similarity is :

2 Find The K-Value:

Now, The next task is to find the k-value. As I said, we can take k=1,3,5,7, etc. But how can we determine that? The first approach is to split the whole dataset into a train and a test dataset. Then apply k values into the training dataset and after applying any algorithm, apply that on the test dataset and check the accuracy. Check accuracies for different k-values, thus you can understand where to stop. But this is a naive approach. You can never know the actual result as suppose if the splitting was not done well then it can overfit or underfit the model. So better try the k-fold validation approach. I am sharing a link that how to implement that, have a look. https://machinelearningmastery.com/k-fold-cross-validation/

So, Let’s dive into the code and see how KNN works.

Data preprocessing and splitting the train, test data

Code(1st part)

Applying KNeighborsClassifier from Scikit-Learn and estimate the k-value

Code(2nd part)

Checking How good our model works

Code (last part)

Want to give any suggestions??? Please let us know. Hope you like the article.

Thank you for your time.

--

--

soumik dey

Hi, I am Soumik, a Data Science Enthusiast. I enjoy writing blogs, to help the data science community and also to engage in discussion.