k-Nearest Neighbors

SarahXu
3 min readMay 29, 2020

--

From this blog you are going to learn about k-Nearest Neighbors and see how to implement the k-Nearest Neighbors algorithm from scratch in python.

k-Nearest Neighbors

The Neighborhood Algorithm, or k-nearest neighbor (kNN, k-NearestNeighbor) classification algorithm is one of the simplest methods in data mining classification techniques. It is useful for finding recommendations similar to what Netflix does. What is meant by k nearest neighbors is that each sample can be represented by its k nearest neighbors.

The core idea of the kNN algorithm is that if most of the adjacent samples belong to a certain category, then the unknown sample belongs to the same category. You determine what are the adjacent samples by calculating the distance between the unknown sample and known samples. You need to perform these calculations on different radial distances to avoid getting

a false positive or negative.

Implement k-Nearest Neighbors

we use Euclidean distance to measure the straight line distance between two vectors.

The full kNN class is listed below.

To locate the neighbors we must calculate the distance between the test point and all other points in the training set. Once distances are calculated, we must sort this list. then we select the top k to return the most similar neighbors.

Next, we apply the KNN algorithm to the Iris dataset. The example is below.

Now, we check the accuracy and compare it with versions in the sklearn library.

K Value

Even though the code from scratch resulted in a lower accuracy score, it gives us a chance to learn more about this algorithm. we know that defining a k value would have different results to our sample. From the image below we can see how important it is to choose k value.

This image shows that, if we choose a low k value like k equals one, we could easily put our red sample into black. but we can see that it is wrong.

If we choose an incorrect maximum value our image would look like below

again, this would go with black, which is wrong.

Choosing an appropriate k value gives us a better result

Takeaways

1. The core idea of ​​the kNN algorithm is that, given a training data set, for a new input instance, find the K instances closest to the instance in the training data set. if the majority of these K instances belong to a certain class, classify the input instance into this class.

2. The selection of k value can neither be too large nor too small.

--

--