KNearestNeighbor Algorithm

Sarath SL
Analytics Vidhya
Published in
4 min readSep 10, 2019

K nearest neighbor algorithm is one of the algorithms used in Machine learning for classification and regression problems. KNN algorithm uses existing data and classify new data points based on similarities and features of existing data points.

One of the use of KNN algorithm is in “search”. Often while purchasing items from online shopping you would have observed that, when you will be getting similar kind options on the same page where your item is being displayed.
That is when your task is some form of “find items similar to this one” .
You will call this as KNN search.

KNN belongs to the supervised learning domain and finds intense application in pattern recognition, data mining and intrusion detection.

KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

  1. First step of KNN algorithm building model is to handle the data and pre-process it before generating training and test data.
  2. Choice of k is very critical — A small value of k means that noise will have a higher influence on the result. A large value make it computationally expensive and defeats the basic philosophy behind KNN (that points that are near might have similar densities or classes) .

A simple approach to select k is set k = n^(1/2).

It’s going to depend a lot on your individual cases, sometimes it is best to run through each possible value for k and decide for yourself.

When we plot data records into a mathematical space, based on the classification we will be having data like shown in the picture. Here circle and triangle are two different classes of records.

Now using KNN, when we come up with a new data point in the mathematical space, it will first calculate the distance between already existing data points and new data point introduced in the mathematical space to find the k similar data instances.

For Better Understanding…..
Suppose after observing and processing the data , we came up with k value as 6, then using KNN algorithm the new data point will find 6 nearest or shortest data points.

Rhombus is the new data point

To calculate the distance between two data points in mathematical space, we have different formulas for distance calculations:

Once we get the k nearest neighbors, the new data point will use class labels of the nearest neighbors to determine the class label of unknown record/data.
This is done by taking majority Vote between the k nearest neighbors.

Here in the .gif animation, we have k value as 3 and after finding the nearest neighbor of the new data point , the new record will tend to shift to triangle class based on the majority vote as triangle data record appears more in the k nearest radius.

KNN is termed as lazy algorithm because it doesn’t learn a discriminative function from the training data but “memorizes” the training dataset instead.

For example, the logistic regression algorithm learns its model weights (parameters) during training time. In contrast, there is no training time in K-NN. Although this may sound very convenient, this property doesn’t come without a cost: The “prediction” step in K-NN is relatively expensive! Each time we want to make a prediction, K-NN is searching for the nearest neighbor(s) in the entire training set!

Let’s Code:
“KNN falls in the supervised learning family of algorithms. Informally, this means that we are given a labelled data set consisting of training observations (x,y) and would like to capture the relationship between x and y. More formally, our goal is to learn a function h:X→Y so that given an unseen observation x, h(x) can confidently predict the corresponding output y.”

  • First we need to get all the necessary libraries required for data reading , data pre-processing and data visualization.
  • Now the data is ready for building machine learning models, let us now split the data into training and test data.
  • Build the model with different k values to check the best optimal Value.
  • Check the Misclassification error vs k for cross validation.

Find the Python code here

Thank you!
Keep Supporting

--

--