K-Nearest Neighbors Algorithm

Published in

Analytics Vidhya

20 min readAug 6, 2020

KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure determined from the dataset. This will be very helpful in practice where most of the real-world datasets do not follow mathematical theoretical assumptions.

KNN is one of the most simple and traditional non-parametric techniques to classify samples. Given an input vector, KNN calculates the approximate distances between the vectors and then assign the points which are not yet labeled to the class of its K-nearest neighbors.

The lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase. This makes training faster and the testing phase slower and costlier. The costly testing phase means time and memory. In the worst case, KNN needs more time to scan all data points, and scanning all data points will require more memory for storing training data.

K-NN for classification

Classification is a type of supervised learning. It specifies the class to which data elements belong to and is best used when the output has finite and discrete values. It predicts a class for an input variable as well.

Consider given review is positive (or) Negative, classification is all about if we give a new query points determine (or) predict the given review is positive (or) Negative.

Classification is all about learning the function for given points.

How does the K-NN algorithm work?

In K-NN, K is the number of nearest neighbors. The number of neighbors is the core deciding factor. K is generally an odd number if the number of classes is 2. When K=1, then the algorithm is known as the nearest neighbor algorithm. This is the simplest case. Suppose P1 is the point, for which label needs to predict. First, you find the one closest point to P1 and then the label of the nearest point assigned to P1.

Suppose P1 is the point, for which label needs to predict. First, you find the k closest point to P1 and then classify points by majority vote of its k neighbors. Each object votes for their class and the class with the most votes is taken as the prediction. For finding closest similar points, you find the distance between points using distance measures such as Euclidean distance, Hamming distance, Manhattan distance, and Minkowski distance.

K-NN has the following basic steps:

Calculate distance
Find closest neighbors
Vote for labels
Take the majority Vote

Failure Cases of K-NN:

1.When Query Point is far away from the data points.

2.If we have Jumble data sets.

For the above image shows jumble sets of data set, no useful information in the above data set. In this situation, the algorithm may be failing.

Distance Measures in K-NN: There are mainly four distance measures in Machine Learning Listed below.

Euclidean Distance
Manhattan Distance
Minkowski Distance
Hamming Distance

Euclidean Distance
The Euclidean distance between two points in either the plane or 3-dimensional space measures the length of a segment connecting the two points. It is the most obvious way of representing distance between two points. Euclidean distance marks the shortest route of the two points.

The Pythagorean Theorem can be used to calculate the distance between two points, as shown in the figure below. If the points (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are in 2-dimensional space, then the Euclidean distance between them is

Euclidean distance is called an L2 Norm of a vector.

Norm means the distance between two vectors.

Euclidean distance from an origin is given by

Manhattan Distance
The Manhattan distance between two vectors (city blocks) is equal to the one-norm of the distance between the vectors. The distance function (also called a “metric”) involved is also called the “taxi cab” metric.

Manhattan distance between two vectors is called as L1 Norm of a vector.

In L2 Norm we take the sum of the Squaring of the difference between elements vectors, in L1 Norm we take the sum of the absolute difference between elements vectors.

Manhattan Distance between two points (x1, y1) and (x2, y2) is:
|x1 — x2| + |y1 — y2|.

Manhattan Distance from an origin is given by

Minkowski Distance
Minkowski distance is a metric in a normed vector space. Minkowski distance is used for distance similarity of vector. Given two or more vectors, find distance similarity of these vectors.

Minkowski distance is called the LP Norm of a vector.

p !=0 , P is always greater than 0(p>0)

Euclidean distance from Minkowski distance

When p = 2, Minkowski distance is the same as the Euclidean distance.

Manhattan distance from Minkowski distance

When p = 1, Minkowski distance is the same as the Manhattan distance.

Hamming Distance
Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different.

It is used in Text preprocessing ,Binary vector ,Boolean vector.

consider x1,x2 are Boolean vectors ,

x1=[0,1,0,1,1,0,1]

x2=[1,1,0,0,1,0,0]

Hamming Distance(x1,x2)=# of locations / dimensions where Binary vectors are different.

for above example Hamming Distance(x1,x2)=3

Hamming distance used in Gene-code sequence.

Cosine distance and cosine similarity:

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

The cosine similarity is advantageous because even if the two similar documents are far apart by the Euclidean distance (due to the size of the document), chances are they may still be oriented closer together. The smaller the angle, the higher the cosine similarity.

The relation between cosine similarity and cosine distance can be defined as below.

Similarity decreases when the distance between two vectors increases.

2. Similarity increases when the distance between two vectors decreases.

Cosine Similarity and Cosine Distance:

Cosine similarity says that to find the similarity between two points or vectors we need to find the Angle between them.

The formula to find the Cosine Similarity and Distance is as below:

Cosine Similarity= cosθ

cosine Distance=1- cosθ

when cosine similarity (x1,x2) is very similar to cosine similarity (x1,x2) equal to 1. If it is a very dissimilar cosine similarity (x1,x2) equal to -1.

θ is the angle between x1 and x2.

Cosine distance is using the angle between two points but Euclidean distance uses the Geometrical distance of two points.

If A and B are unit vectors than ||A|| =||B||=1

than similarity, cos(θ)=A.B

Relationship between Euclidean distance and Cosine distance.

If A and B are unit vectors than ||A|| =||B||=1

The square of [Euclidean-distance(x1,x2)] = 2(1-cos(θ))

The square of [Euclidean-distance(x1,x2)]=2 cosine distance (x1,x2)

The performance of the K-NN algorithm is influenced by three main factors :

The distance function or distance metric used to determine the nearest neighbors.
The decision rule used to derive a classification from the K-nearest neighbors.
The number of neighbors used to classify the new example.

Decision surface for K-NN as K changes:

when the K=1 decision curve is sharp edges and non-smooth curves and the classifier doesn’t make any mistakes.

when the k=5 decision surface has a smooth curve and the classifier makes small mistakes.

In K-NN the smoothness of the decision surface increases as K increases.

when k=n, the classifier gives every query point belongs to the Majority class. when the K=n classifier makes more errors.

we have played with the K-NN decision surface with some toy data sets shown below images.

By seeing the above images we observe that as k increases the decision surface is going to smooth.

To see full code and implementation in python visit here.

Overfitting and Underfitting of a Model

Using the Dtrain process of finding the right function of the given data is called fitting.

when k=1 the model is overfitting to the data because our model doesn’t make any errors.

when k=n the model is underfitting to the data because our model makes more errors, it considers every query point belongs to the Majority class.

The balance between overfitting and underfitting is well-fit, it makes some errors because Machine Learning is not perfect, it’s ok to make small mistakes.

But you may think that how can we sure that our model is underfitting or overfitting?

The answer is by plotting method.

We want our model is Maximum accuracy or minimum error on the cross-validation dataset.

Training Error: Training errors occur when a trained model returns errors after running it on the data again. It starts returning the wrong results. There is one logical assumption here by the way, and that is your training set will not include same training samples belonging to different classes, i.e. conflicting information. Some real world datasets might have this property though.
Cross-validation Error: Error occurs when choosing best K ,by the use of Cross-validation.

When Train error is low and validation error is high we face a problem of overfitting shown in the above image.

When Train error is high and validation error also high we face a problem of underfitting shown in the above image.

We choose our model as some train error and some validation errors both are close to each other, shown in the above image best fit.

How to find the best K?

By cross-validation, we find the optimal K.

Cross-Validation is just Make some data to seen to the function and some data to unseen to the function.

Cross-validation (CV) is one of the techniques used to test the effectiveness of machine learning models, it is also a resampling procedure used to evaluate a model if we have limited data. To perform CV we need to keep aside a sample/portion of the data on which is do not use to train the model, later us this sample for testing/validating.

Below are the few common techniques used for CV.

Train_Test Split approach.

In this approach, we randomly split the complete data into training and test sets. Then Perform the model training on the training set and use the test set for validation purpose, ideally split the data into 70:30 or 80:20. If our data is huge and our test sample and train sample has the same distribution then this approach is acceptable.

We can manually split the data into train and test set using slicing or we can use the train_test_split of the sci-kit-learn method for this task. Complete documentation is here.

In this approach one problem,

With this approach, there is a possibility of high bias if we have limited data because we would miss some information about the data which we have not used for training.

So for limited data, another approach is there called K-Fold Cross-Validation.

K-Fold Cross Validation:

K-Fold is popular and easy to understand, it generally results in a less biased model compare to other methods. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one of the best approaches if we have limited input data. This method follows the below steps.

This method follows the below steps.

Randomly split your entire dataset into k” folds”
For each k-fold in your dataset, build your model on k — 1 folds of the dataset. Then, test the model to check the effectiveness for kth fold
Record the error you see on each of the predictions
Repeat this until each of the k-folds has served as the test set
The average of your k recorded errors is called the cross-validation error and will serve as your performance metric for the model

Repeat this process until every K-fold serves as the test set. Then take the average of your recorded scores. That will be the performance metric for the model.

This will take more time but, in this method, we used every part of the data to training and every part of the data for cross-validation.

To see full code and implementation of cross-validation in python visit here.

K-NN for Regression

In Regression the output is no more part of a small finite set of classes, the output Yi belongs to real number (or) infinite sets of classes.

Step’s for K-NN Regression as follows.

Find the optimal-K for given points by cross-validation (or) K-Fold cross-validation.
For the given query point find K-Nearest Neighbors.
Take the mean or median of all the K-Nearest Neighbors.
Median is less prone to outliers.
For K-NN Classification we take the Majority vote of K-Nearest Neighbors, for the K-NN regression, we take the mean (or) median of K-Nearest Neighbors.
K-NN is the simple extension from classification to Regression.

Weighted K-NN:

Weighted K-NN gives importance to the weight of each point.

Weighted K-NN is a modified version of k nearest neighbors. … The simplest method is to take the majority vote, but this can be a problem if the nearest neighbors vary widely in their distance and the closest neighbors more reliably indicate the class of the object.

Consider the above image assume 5-NN then by distance measure we will get the 3 positive points and 2 Negative points, by taking the majority vote we conclude that the given query point is positive.

Assume the distance from the query point to 5-NN is the same as the below image.

But this is not truly compared to positive points negative points are very close to query points, so consider the weights of the 5-NN points.

Relationship between weight and distance of the point is given by

When distance increases weight decreases and distance decreases weight increases.

Now consider the weight of each point

Let consider the sum of the weight of the positive class and Negative class.

Weight of the Negative class is 10+5 =5

Weight of the Positive class is 3.33+1.25+0.66= 5.24

Consider which class is the highest weight that class becomes the predicted class so Negative class becomes our predicted class from the weighted K-NN.

Time Complexity of k-NN

Let’s look at the time complexity of k-NN. We are in a d d-dimensional space.
To make it easier, let’s assume we’ve already processed some number of inputs, and we want to know the time complexity of adding one more data point.
When training, k-NN simply memorizes the labels of each data point it sees.
This means adding one more data point is O(d).
When testing, we need to compute the distance between our new data point and all of the data points we trained on.
If NN is the number of data points we have trained on, then our time complexity for training is O(dn). Classifying one test input is also O(dn).

Limitations of K-NN

Large storage requirements.
Computationally intensive recall.
K-NN as more time as space Complexity.
K-NN is not suitable for low latency applications used in internet companies.

K-D Tree

Because of the large space and time complexity K-NN to reduce the space and time complexity K-D Tree is invented.

A K-D Tree(also called as K-Dimensional Tree) is a binary search tree where data in each node is a K-Dimensional point in space. In short, it is a space partitioning data structure for organizing points in a K-Dimensional space.

the k-d tree is guaranteed log2 n depth where n is the number of points in the set. Traditionally, k-d trees store points in d-dimensional space which are equivalent to vectors in d-dimensional space.

Pick X-axis and project each point onto an x-axis then compute the median and split the data using the median.

then Pick Y-axis and project each point onto a y-axis then compute the median and split the data using the median.

Repeat the above steps and change the axis alternatively and build a tree.

A non-leaf node in K-D Tree divides the space into two parts, called half-spaces.

Points to the left of this space are represented by the left subtree of that node and points to the right of the space are represented by the right subtree.

If there is just one point, form a leaf with that point. Otherwise, divide the points in a half by a line perpendicular to one of the axes. Recursively construct k-d trees for the two sets of points. divide points perpendicular to the axis with the widest spread.

Find the Nearest Neighbors using K-D Tree

Search recursively to find the point in the same cell as the query. On the return search each subtree where a closer point than the one you already know about might be found.

Keep the variable of the closest point to the query point found. Prune subtrees once their bounding boxes say that they can’t contain any point closer than the query point. Search the subtrees so that maximizes the chance for pruning.

For step by step implementation for building K-D Tree building visit here.

Time Complexity of k-D Tree

K-D Tree runs in O(log n) average time per search in a reasonable model. (Assume d is small).Storage for the k-d tree is O(n). Preprocessing time is O(n log n) assuming d is small.

Limitations of k-D Tree

when d is not small-time complexity increases dramatically.
K-D Tree works well only on data is uniformly distributed and the dimension is small. But most real-world data is not uniformly distributed.
K-D Tree is not invented for K-NN, it is mainly invented for information retrieval and computer graphics.

Locality Sensitive Hashing

It’s a beautiful technique to find similar points that are nearest neighbors. It uses the concept of Hashing for this.

You may be familiar with how the Hash Table is constructed. If not, please follow this link for a quick refresher on Hashing concepts. It’s a very efficient data structure that allows us to perform operations in O(1) time.

LSH is a hashing based algorithm to identify approximate nearest neighbors. In the normal nearest neighbor problem, there are a bunch of points (let’s refer to these as training set) in space and given a new point, the objective is to identify the point in the training set closest to the given point.

Locality Sensitive Hashing is a set of techniques that dramatically speed up search-for-neighbors or near-duplication detection on data. These techniques can be used, for example, to filter out duplicates of scraped web pages at an impressive speed, or to perform near-constant-time lookups of nearby points from a geospatial data set.

Hash functions typically have these key properties:

They map some type of input, such as strings or floats, to discrete values, such as integers.
They’re designed so that two inputs will result in hash outputs that are either different or the same based on key properties of the inputs.

How does it work?

We apply a hash function to each available data point. This gives us the bucket or the key where the point can be placed.

We try to maximize collisions so that similar points go to the same bucket.

Now if we get an unlabelled query point and apply the hash function then it will go to the same bucket where similar points exist.

This makes it easy to find the nearest neighbors for the query point. And subsequently, we can apply k-NN to predict it’s class.

Locality Sensitive Hashing using Cosine Similarity

The problem we are trying to solve is to predict the class of a new data point, given a dataset with pre-classified data points.

LSH is a probabilistic and randomized algorithm. so it is not perfect and extensively used in computer vision.

What is Cosine Similarity?
At a high-level cosine similarity can tell us how similar two points are. To do this we compute the vector representation for the two points and then find the angle between the two vectors.

The similarity between vectors a and b can be given by cosine of the angle between them.

We can use this concept to calculate the hash value for a data point.

Now that we know cosine similarity we can use this to calculate LSH values for data points. To do this we divide the space using hyperplanes.

For simplicity consider a 2-D space with the X-Y axis. We can divide this into several regions by using 3planes (or) lines π1,π2, and π3.

So assume data point “x1”,”x2”,”x3”,”x4”,”x5” will reside in one of these regions. For each plane, we can find in which direction this point lies, by using the concept of the normal vector.

This way we can find the value for each plane. For each plane, the value will be either +1 or -1. We can use this to calculate the Hash Key.

If the two points in the same bucket for a hyperplane π than the distance can be closer.

Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.

Say the new point lands in the bucket with key =1. Then we know it’s near to the points. Next, apply k-NN to find it’s classification.

some times could miss the Nearest Neighbors in cosine-similarity as our distance measure. To solve this problem construct multiple Hash tables for all points and finally take the common Nearest Neighbors in the bucket.

As the number of hyperplane increases, the number of slices increases than the number of points per slices decreases.

Locality Sensitive Hashing using Euclidean Distance

It’s quite similar to Locality Sensitive Hashing (LSH) for Cosine Similarity which we covered earlier. I will be referring to the same here, so it’s better if you go through the same before proceeding.

The difference lies in the way we compute the hash value. As we have seen we can divide the region using planes. In each region, we can have data-points.

Divide the plane into small parts, project each data-point on the planes, for each datapoint take the distance along each plane and use it to calculate the hash value.

project perpendicular to the plane of all the points. similar (or) closer points should go to the same region (or) bucket.

Breakup the whole region into pieces (or) regions.

If two points are close to each other then there are high chances that they will map to the same interval. Conversely, if the points are far away, it is unlikely they will fall into the same bucket

This way we can find the value for each plane. For each plane, the value will be either positive values and Negative values. We can use this to calculate the Hash Key.

Once we have the hash table in place we can use this to determine the key for a new data-point. And then find the nearest neighbors.

Say the new point lands in the bucket with key =1. Then we know it’s near to the points. Next, apply k-NN to find it’s classification.

For more information about Locality Sensitive Hashing visit here.

K-NN as Probabilistic Class Label

KNN is a remarkably simple algorithm with proven error-rates.

Consider 2 Class Classification Y ∈ {0,1} for given points X.

Assume 7-NN (K=7), the query point Xq belongs to 4 red points assumed to be 0 as Negative class label and 3 blue points assumed to be 1 as positive, then by the majority vote of 7-NN, it is classified as 0.

Instead of majority vote if we consider the probabilistic approach it is gives the certainty of the prediction.

Consider for Negative points P(Yq∈0)=4/7=0.57

Consider for Positive points P(Yq∈1)=3/7=0.42

From the above probabilistic approach, we conclude that the query point belongs to the Negative class is 57 % and the query point belongs to the Positive class is 42 %.

Pros of K Nearest Neighbors

Simple algorithm and hence easy to interpret the prediction.
Non-parametric so makes no assumption about the underlying data pattern.
used for both classification and Regression.
The training step is much faster for the nearest neighbors compared to other machine learning algorithms.

Cons of K Nearest Neighbors

KNN is computationally expensive as it searches the nearest neighbors for the new point at the prediction stage
High memory requirement as KNN has to store all the data points
Prediction stage is very costly
Sensitive to outliers, accuracy is impacted by noise or irrelevant data.

This is a small introduction to K-Nearest Neighbors.

References

Applied AI
Coursera
Data Camp

Thanks for reading and for your patience. Let me know if there are any errors in my post. Let’s discuss in the comments if you find anything wrong in the post or if you have anything to add...

Happy Learning!!