k-Nearest Neighbors

9 min readApr 2, 2019

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement, non-parametric, lazy learning, supervised machine learning algorithm that can be used to solve both classification and regression problems using feature similarity.

Learning KNN machine learning algorithm is a great way to introduce yourself to machine learning and classification in general. At its most basic level, it is essentially classification by finding the most similar data points in the training data, and making an educated guess based on their classifications.

Although very simple to understand and implement, this method has seen wide application in many domains, such as in recommendation systems, semantic searching, and anomaly detection.

What is K- Nearest Neighbors?

K- Nearest Neighbors is a

Non parametric as it does not make an assumption about the underlying data distribution pattern
Lazy algorithm as KNN does not have a training step. All data points will be used only at the time of prediction. With no training step, prediction step is costly.
Supervised machine learning algorithm as target variable is known
Used for both Classification and Regression
Uses feature similarity/nearest neighbors to predict the cluster that the new point will fall into.

Example: Let’s take a real life example and understand.

You moved to a new neighborhood and want to be friends with your neighbors. You start to socialize with your neighbors. Yo decide to pick neighbors that match your thinking, interests and hobbies. Here thinking, interest and hobby are features. You decide your neighborhood friend circle based on interest, hobby and thinking similarity. This is analogous to how KNN works

Just for reference, this is “where” KNN is positioned in the algorithm list of scikit learn. BTW, scikit-learn documentation is amazing! Worth a read if you’re interested in machine learning.

Breaking it down

When we say a technique is non-parametric , it means that it does not make any assumptions on the underlying data distribution.

No assumptions on the underlying data

In other words, the model structure is determined from the data. If you think about it, it’s pretty useful, because in the “real world”, most of the data does not obey the typical theoretical assumptions made (as in linear regression models, for example). Therefore, KNN could and probably should be one of the first choices for a classification study when there is little or no prior knowledge about the distribution data.

KNN is also a lazy algorithm (as opposed to an eager algorithm). What this means is that it does not use the training data points to do any generalization. In other words, there is no explicit training phase. Lack of generalization means that KNN keeps all the training data. To be more exact, all (or most) the training data is needed during the testing phase.

We must be able to keep the entire training set in memory unless we apply some type of reduction to the data-set, and performing predictions can be computationally expensive as the algorithm parse through all data points for each prediction. For these reasons, kNN tends to work best on smaller data-sets that do not have many features.

A supervised machine learning algorithm (as opposed to an unsupervised machine learning algorithm) is one that relies on labeled input data to learn a function that produces an appropriate output when given new unlabeled data.

KNN can be used for classification — the output is a class membership (predicts a class — a discrete value). An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors. It can also be used for regression — output is the value for the object (predicts continuous values). This value is the average (or median) of the values of its k nearest neighbors.

KNN Algorithm is based on feature similarity: How closely out-of-sample features resemble our training set determines how we classify a given data point:

Example of k-NN classification. The test sample (inside circle) should be classified either to the first class of blue squares or to the second class of red triangles. If *k = 3* (outside circle) it is assigned to the second class because there are 2 triangles and only 1 square inside the inner circle. If, for example *k = 5* it is assigned to the first class (3 squares vs. 2 triangles outside the outer circle).

What is ‘k’ is KNN?

k is a number used to identify similar neighbors for the new data point.

Referring to our example of friend circle in our new neighborhood. We select 3 neighbors that we want to be very close friends based on common thinking or hobbies. In this case k is 3.

KNN takes k nearest neighbors to decide where the new data point with belong to. This decision is based on feature similarity.

KNN Algorithm

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

“Birds of a feather flock together.”

KNN captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood — calculating the distance between points on a graph.

Note: If you are unfamiliar with or need a refresher on how this calculation is done, thoroughly read “Distance Between 2 Points” in its entirety, and come right back.

There are other ways of calculating distance, and one way might be preferable depending on the problem we are solving. However, the straight-line distance (also called the Euclidean distance) is a popular and familiar choice.

The gist of the kNN algorithm is:

1. Compute a distance value between the item to be predicted and every item in the training data-set2. Pick the k closest data points (the items with the k lowest distances)3. Get the labels/values of the selected k neighbors    - If regression, return the mean of the 'k' neighbors
    
    - If classification, return the mode of the 'k' neighbors

Step 1: Choose a value for K. K should be an odd number.

Step 2: Find the distance of the new point to each of the training data.

Step 3: Find the K nearest neighbors to the new data point.

For classification, count the number of data points in each category among the k neighbors. New data point will belong to class that has the most neighbors.
For regression, value for the new data point’s value will be the average of the k neighbors.

There are two important decisions that must be made before making classifications. One is the value of K that will be used; this can either be decided arbitrarily, or you can try cross-validation to find an optimal value. The next, and the most complex, is the distance metric that will be used.

Choosing the right value for ‘k’

To select the ‘k’ that’s right for your data, we run the KNN algorithm several times with different values of ‘k’ and choose the ‘k’ that reduces the number of errors we encounter while maintaining the algorithm’s ability to accurately make predictions on unseen data.

Here are some things to keep in mind:

As we decrease the value of K to 1, our predictions become less stable. Just think for a minute, image K=1 and we have a query point surrounded by several reds and one green (I’m thinking about the top left corner of the colored plot above), but the green is the single nearest neighbor. Reasonably, we would think the query point is most likely red, but because K=1, KNN incorrectly predicts that the query point is green.
Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far.
In cases where we are taking a majority vote (e.g. picking the mode in a classification problem) among labels, we usually make K an odd number to have a tiebreaker.

Distance metric

There are many different ways to compute distance, as it is a fairly ambiguous notion, and the proper metric to use is always going to be determined by the data-set and the task at hand.

Distance can be calculated using

Euclidean distance
Manhattan distance
Minkowski Distance
Hamming Distance

Euclidean distance is the square root of the sum of squared distance between two points. It is also known as L2 norm.

Manhattan distance is the sum of the absolute values of the differences between two points

Minkowski distance is the used to find distance similarity between two points. When p=1, it becomes Manhattan distance and when p=2, it becomes Euclidean distance

Hamming distance is used for categorical variables. In simple terms it tells us if the two categorical variables are same or not.

Another common metric is Cosine similarity. Rather than calculating a magnitude, Cosine similarity instead uses the difference in direction between two vectors.

A few Applications and Examples of KNN

Credit ratings — collecting financial characteristics vs. comparing people with similar financial features to a database. By the very nature of a credit rating, people who have similar financial details would be given similar credit ratings. Therefore, they would like to be able to use this existing database to predict a new customer’s credit rating, without having to perform all the calculations.
Should the bank give a loan to an individual? Would an individual default on his or her loan? Is that person closer in characteristics to people who defaulted or did not default on their loans?
In political science — classing a potential voter to a “will vote” or “will not vote”, or to “vote Democrat” or “vote Republican”.
More advance examples could include handwriting detection (like OCR), image recognition and even video recognition.

Some pros and cons of KNN

Pros:

No assumptions about data — useful, for example, for nonlinear data
Simple algorithm — to explain and understand/interpret
High accuracy (relatively) — it is pretty high but not competitive in comparison to better supervised learning models
Versatile — useful for classification (or) regression

Cons:

Computationally expensive — because the algorithm stores all of the training data
High memory requirement — stores all (or almost all) of the training data
The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase.
Sensitive to irrelevant features and the scale of the data

Summary

The k-nearest neighbors (KNN) algorithm is a simple, non-parametric, lazy learning, supervised machine learning algorithm that can be used to solve both classification and regression problems. It’s easy to implement and understand, but has a major drawback of becoming significantly slows as the size of that data in use grows.

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (k) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

A few other features of KNN:

KNN stores the entire training dataset which it uses as its representation.
KNN does not learn any model.
KNN makes predictions just-in-time by calculating the similarity between an input sample and each training instance.

I hope this helps in understanding what the K-Nearest Neighbor algorithm is.

k-Nearest Neighbors

Breaking it down

A few Applications and Examples of KNN

Some pros and cons of KNN

Written by Raghunath D