K Nearest Neighbor Classifier — from scratch

Dean Hadzi
Lambda Build Week Projects Journey
3 min readOct 26, 2020

K-nearest algorithm is a common tool used in machine learning in both Classification and Regression problems. In a nutshell, this algorithm takes a data point in which we are interested and tries to find the k-number of closest points in the feature space.

Once these neighbors are found, the result will be presented in two possible manners. As a class member prediction (based on the largest representation of those neighbors) in case of Classification, or the average of those neighbors’ values if we are using this method for Regression.

Let’s see how this can be implemented from scratch…

In reproducing the algorithm we have to follow three basic steps:

  1. Calculate the Euclidean Distance

This measurement represents a number, the actual length of line segment between two points. In order to calculate it, we have to verify that our data point in question has the same number of coordinates as all of our data points in our test sample. Once that condition is settled and satisfied, we simply sum up the difference of all matching coordinates squared and finally take the square root of that sum.

A simple example might be able to clarify this point. In this example we will have two points, A with coordinates (2, 5) and B with coordinates (-3, 7).

Step 1 — calculate the difference of matching coordinates and sum it all up.

(A1 — B1)² and (A2 — B2)². In our case that would be (2 — (-3))² and (5–7)².

Step 2 — Sum up the items from step 1. In our case we get 25 and 4, sum is obviously 29.

Step 3 — Take the square root of the sum. Sqrt(29) = 5.385. This number represents the Euclidean distance between A and B.

2. Get the Neighbors

After we completed step 1, we repeat the same step over and over again, until we calculated the distance between EACH point in our train set and our data point in question.

Continuing the example from before, let’s imagine we are trying to classify point A. Our train set consists of five other points: B, C, D, E and F. We basically do Euclidean distance between A and B, A and C and so on… This can be very tedious for humans, but computers will be able to crunch a huge number of these calculations by using simple for loops.

Each calculation will yield the result. The smaller the result, the better. Small number values represent proximity in feature space. Let’s assume for a second that our calculation yielded the following five results in order: 5.385, 4.12, 0.17, 1.25 and 2.56.

We would sort these results in ascending order and keep the number of results that matches our k-value. If k is 3, we would keep that many, discarding the rest.

3. Make Prediction

To finalize the process, let’s also assume that each point has associated picture with it, maybe cats and dogs. Let’s say that point of our interest is a picture of an animal of which we are not entirely sure what it is. The way we would be able to predict the result, would be by simply counting the majority of our neighbors from the previous step.

The three smallest values were points D, E and F. D has a picture of dog at its coordinates, while E and F has a picture of Cat. By simply counting the frequency of each category in our results, we would propose that our data point in question is actually a picture of Cat.

And that is basically it.

Of course, this is a very simple explanation of the complex mathematical algorithm, which proves to be very robust and decent in its predictions, regardless of its relatively simple methods. It can be used in higher dimensional spaces, where your features are represented by hundreds of points. As such it is often used to establish a baseline in your research, before you start investing valuable resources into more complex methods.

I hope this explanation helped you in understanding it better. For an actual python code implementation visit the following link: https://github.com/deanhadzi/cs_unit_1_build

--

--