K Nearest Neighbors

Shubhangi Hora
4 min readOct 7, 2018

--

Unsplash

In the previous article, I explained the Logistic Regression algorithm, which, despite its name, is a classification algorithm most commonly used in cases where there are two output classes.

K Nearest Neighbors is also a classifier, and as the name suggests, this algorithm assigns a class to a data point based on the class assigned to that data point’s nearest neighbors. K is a positive integer, which tells the algorithm how many nearest neighbors to consider when assigning a class to a data point. Once the nearest neighbors have been determined, each data neighbor ‘votes’ for its class, i.e., they help the algorithm understand which class the majority of the data points are from. This class is ultimately assigned to the new data point.

Source

For example, in the image above, the red star is the new data point that needs to be classified. If the value of k is 3, then the three data points closest to the star are considered its nearest neighbors, which are two data points from class B and one from class A. Hence, since the majority of the star’s neighbors belong to class B, the red star is assigned class B.

However, if the value of k is 6, then that majority of the star’s neighbors belong to class A. Hence, the class assigned to the red star is class A.

But how does the algorithm decide which data points are the nearest neighbors?

The algorithm calculates the distance between the new data point and all the other data points. There a different distance functions that are used to do this, and they include:

Source

Additionally, there are several methods to determine which data points are a specific data point’s closest neighbors that work with these distance functions. Let’s look at a few:

1. Brute Force Method

In this method, the Euclidean distance between the new data point and the other data points is calculated. Based on the assigned value of k, the nearest neighbors are determined using the calculated Euclidean distance. These nearest neighbor data points vote for their classes, and so the class that has been assigned to the majority of those points is assigned to the new data point as well.

The Brute Force Method is good when the training set is small and has lesser number of features, since as the training set size and number of features increases, so does the time complexity. This is because the Euclidean distance has to be calculated between one point and all the other points, each time a new point is given to the algorithm.

2. KD — Tree Method

Since the Brute Force method doesn’t work well with large data sets, a variety of other methods have been introduced to make the K Nearest Neighbors algorithm more efficient. Once such method is the KD — Tree method. It does this by storing aggregate distances so that each time a new data point is introduced, the algorithm doesn’t have to calculated the distance between it and all the other data points again.

This method has a better time complexity than the brute force method, but it is still good only while the number of features / dimensions is less than 20.

Let’s assume we have three data points — A, B and C. If A is far from B and B is close to C, then we know that A is far from C without having to calculate the distance between A and C. This is how the KD — Tree method determines what data points are the nearest neighbors. Once this is done, it assigns the class that has been assigned to the majority of the nearest neighbors.

To read more about KD-Tree and another method called Ball Tree, click here.

Click here to see how to use the KNN classifier in Python using Scikit-Learn.

The K Nearest Neighbor algorithm is often called the ‘Lazy Learner’, because instead of learning things during the training period, it just memorizes the whole data set. Once a data set is loaded into the algorithm, it plots all the data points onto an n — dimensional graph, shown below:

Source

The two different colors represent two different classes. Now, every time a new data point is introduced, the algorithm will plot it on to this same graph and then proceed to calculate the distance and determine the nearest neighbors based on whatever method has been specified to it.

Other classifiers, on the other hand, learn parameters during their training period. For example, the Logistic Regression algorithm learns the model coefficients during the training period, and so when a new data point is given to it, it just uses the model coefficients to calculate the probability of the classes, and then assigns the class with the highest probability.

Additionally, K Nearest Neighbors can also be used for regression problems; the difference in the working is that instead of data points voting for their classes, the average of the values of the nearest neighbors is taken and that average is assigned to the new data point. This, however, is not recommended.

Click here to read more!

Next, I’ll be discussing Linear Discriminant Analysis, so stay tuned!

--

--

Shubhangi Hora

A python developer working on AI and ML, with a background in Computer Science and Psychology. Interested in healthcare AI, specifically mental health!