A Guide to k-Nearest Neighbors

A straightforward guide explaining step by step how to implement the algorithm

Johan Mazorra
The Startup
10 min readJul 31, 2020

--

Via Analytics Vidhya

k-Nearest Neighbors Explained

The k-Nearest Neighbors algorithm is quite a straightforward method. The whole training dataset is reserved, and when a prediction is required, the k-most similar records to another record from the training dataset are then found. From these neighbors, a summed-up prediction is made.

Comparability between records can be estimated in several ways. A recommended starting stage is the Euclidean distance since we are for the most part managing tabular data when moving toward Data Science related issues in Python. It is the straight-line distance between two points (I will expand upon into the Euclidean distance and explain the substance of it in another section of this article).

When neighbors are found, a rundown prediction can be made by restoring the most common result or taking the average. Accordingly, KNN are truly flexible and can be utilized for classification or even regression problems.

Also, keep in mind that KNN is a supervised learning algorithm, implying that the models in the dataset must have labels doled out to them/their classes must be known. There are two other significant details to point out about KNN. To start with, KNN is a non-parametric algorithm. This implies no presumptions about the dataset are made when the model is utilized. But rather, the model is developed totally from the given data. Second, there is no splitting of the dataset into training and test sets when utilizing KNN. KNN makes no speculations between a training and testing set, so all the training data is additionally utilized when the model is approached to make predictions.

Furthermore, we will be following three basic steps for implementing the k-Nearest Neighbors algorithm:

1) Formulate the Euclidean distance.

2) Fit the model to the training data and return the predictions.

3) Obtain the Nearest Neighbors.

The Importance of the Value Of K

Via Wikimedia Commons, CC

Before we proceed with the steps on how to implement the KNN algorithm, let me explain why the value of k matters. The fundamental constraint when utilizing KNN is that a wrong estimation of K (a wrong number of neighbors to be thought of) may be picked. On the off chance that this occurs, the predictions that are returned can be off considerably. It’s significant that, when utilizing a KNN algorithm, the best possible incentive for K is picked. We need to pick an incentive for K that expands the model’s capacity to make predictions on concealed data while lessening the quantity of errors it makes.

Lower estimations of K imply that the predictions delivered by the KNN are less stable and dependable. To get an instinct of why this happens, consider a situation where we have seven neighbors around a target data point. We should assume that the KNN model is working with a K estimation of two (we’re requesting that it take a gander at the two nearest neighbors to make an expectation). On the off chance that the majority of the neighbors (five out of seven) have a place with the Blue class, but the two nearest neighbors simply happen to be Red, the model will predict that the query sample is Red. Regardless of the model’s guess, in such a situation Blue would be a better conjecture.

If so, why not simply pick the highest K value we can? This is on the grounds that telling the model to consider an excessive number of neighbors will likewise lessen precision. As the radius that the KNN model examines increases, it will in the long run begin considering data points that are closer to different groups than they are the target data points and misclassification will begin happening. For instance, regardless of whether the point that was initially picked was in one of the red regions above, if K was set excessively high, the model would venture into different areas to consider points. When utilizing a KNN model, various estimations of K are attempted to see which worth gives the model the best execution.

I. So what exactly is the Euclidean Distance?

Euclidean distance is the distance between two points in Euclidean distance. Euclidean distance was initially formulated by the Greek mathematician Euclid around 300 B.C. to consider the connections among angles and distances. This arrangement of Geometry is still being used today and is the one that high school understudies concentrate on regularly. Euclidean geometry explicitly applies to the points of two and three dimensions. In any case, it can without much of a stretch be summed up to higher request measurements.

A straightforward way of understanding how Euclidean distance would be formulated is to:

1) Take two points P and Q in a two-dimensional Euclidean space. We will depict P with the coordinates (p1, p2) and Q with the coordinates (q1, q2). Next develop a line portion with the endpoints of P and Q. This line fragment will frame the hypotenuse of a correct triangle. Broadening the outcomes, we note that the lengths of the legs of this triangle are given by |p1 — q1| and |p2 — q2|. The distance between the two points will at that point be given as the length of the hypotenuse.

2) Then, utilize the Pythagorean Theorem to decide the length of the hypotenuse. This theorem expresses that c² = a² + b² where c is the length of a right triangle’s hypotenuse and a, b are the lengths of the other two legs. This gives us c = (a² + b²) ^ (1/2) = ((p1 — q1) ^2 + (p2 — q2) ^2) ^ (1/2). The distance between 2 focuses P = (p1, p2) and Q = (q1, q2) in two-dimensional space is subsequently ((p1 — q1)² + (p2 — q2)²)^(1/2)).

3) Expand the results to a three-dimensional space. The distance between focuses P = (p1, p2, p3) and Q = (q1, q2, q3) would then be able to be given as ((p1-q1)² + (p2-q2)² + (p3-q3)²)^(1/2).

4) Sum up the solution for the distance between two points P = (p1, p2, …, pn) and Q = (q1, q2, …, qn) in n measurements. This overall arrangement can be given as ((p1-q1)² + (p2-q2)² + … + (pn-qn)²)^(1/2).

Now, you may be asking: “How would I translate this into code? Where would I even start?” and the answer to that is quite simple actually. Let me guide you through it giving you a sample for the Euclidean distance function:

II. Fit the model and return the predictions

The subsequent step is to fit the KNN classifier, this would be the most straightforward step to follow. This progression is not carefully vital for making predictions, however, stacking the whole historical data into the memory is valuable for the estimation of ACCURATE predictions in the following step.

In Python we would implement it like this:

Then, after fitting the model we will add the predictive function:

Let me clarify what this piece of the code is doing. Basically, we have a few for loops. A main loop which thus incorporates several sub loops.

The main loop emphasizes through the data points of X. On each emphasis, the euclidean_distance() function from the first step is called to figure the distance to each data point in X_train (yes, there are among other productive approaches to do this). As we are not intrigued by each distance estimation of each data point blend, we reject a substantial portion of these Euclidian distance values, and just keeps the k-smallest values (with k being the quantity of neighbors).

Ultimately, we simply need to iterate over neighbors and tally the occurrences of values and return the most constant as our prediction.

III. Obtaining the nearest neighbors

We have a technique for making predictions, and can ascertain how exact our KNN classifier is. Preferably we would need a technique to just return the anticipated neighbors.

The function beneath restores a rundown of tuples containing k closest neighbors and the Euclidean distances to the relating value inputs:

Now, let’s test the performance of this KNN classifier algorithm that we built by doing a comparison against Sklearn’s own KNN classifier. I actually created two tests with different datasets to see how they will compare in accuracy predictions and see if there was one that will achieve a better result between the two. For the first test, I utilized the wine quality dataset, and for the second set I used the digits dataset directly from sklearn’s dataset module.

Load sklearn’s packages and prepare the wine quality data.

KNN Classifier from Sklearn

Result for the accuracy of Sklearn’s KNN classifier

KNN Classifier built from scratch

Result for the accuracy of the KNN classifier built from scratch.

For the second test, the only difference will be the import of the dataset being utilized (in this case the digits dataset) and when loading the data, everything else is exactly the same. Let me show you below:

As stated above, everything else will be set up in the same manner as in the first test. So let’s check the results below starting with sklearn’s KNN classifier.

Yes! As we can see my personal goal of beating Sklearn’s KNN classifier algorithm was pleasantly achieved.

Conclusion

Finally, we implemented a rather simple and highly accurate version of the KNN classifier algorithm. Its performance is superior to Sklearn’s KNN classifier which is a satisfying achievement and makes me extremely content since I put my hardest effort into beating it. In the future, I am planning on making more predicting models and implementing them in a manner which will have a better performance than other machine learning libraries.

Link to the repository for the KNN classifier algorithm that was built from scratch:

--

--

Johan Mazorra
The Startup

Data Science Student / Therapist / Music Producer / Video Editor / Crypto Enthusiast