Creating a k-Nearest Neighbor’s Algorithm with some spare Python and Numpy I found laying around

Emma Rose
4 min readJun 26, 2020
Photo by Jari Hytönen on Unsplash

I have used k-Nearest Neighbor’s for quite a variety of projects. It was at the center of a simple recommendation system I created. For those who are unfamiliar with the concept, it isn’t too complicated. Say you have a large dataset with various measurements of cats. Their weight, length, color, if they have a cute meow. If you put all this information in about your cat, k-Nearest Neighbors would bring you the most similar cat to yours. Or, if you want a party, 100 cute cats that are the most similar to yours.

It measures the distance between your input and all the data you are comparing it against, then returns the closest item to your input.

That is all there is to it. It is simple, versatile, and effective for many types of problems.

Photo by Tran Mau Tri Tam on Unsplash

The key to this handy code is the algorithm used to measure Euclidean distance. To put that in layman’s terms, it is a measurement of a straight line between two points.

Created by Wikipedia user Kmhkmh

Wikipedia has a wonderful, intimidating explanation of this simple math. In simpler terms, what happens is each corresponding number in the input array is subtracted from its opposite in the original data. Imagine, the weight of your cat is subtracted from the weight of the first one in the list. To take care of negative values, this number is squared. Then we take the square root of the measurement between your cat and the first (to undo that squaring), and ta-da, we have the Euclidean distance!

Euclidean distance function

After completing this for every single cat in the list, the next step is to simply sort them by value. The cat with the smallest number in each category (weight, length, color, cute meow) is the cat the most similar to yours. The k-NN function will return as many of the nearest neighbors that you wish, and provide the distance measured for each.

Below you will see the e_distance function implemented in the final k-NN function.

k-Nearest Neighbors

So if you imagine a small set of cat data, documenting the weight, length, and if it is a kitten or not, it may look like this. We then have the data from your cat (assuming it is not a kitten), which we put into our k-NN function, and expect two of the closest cats to be output. You can imagine the power this algorithm as we scale up in size and usefulness of data.

k-Nearest Neighbors Output

The first number you see is a measure of difference between your cat and the closest one. Then its measurements are printed out, and as you can see it is very close in weight and length to your cat. It then correctly predicts your cat is an adult using the two closest cats as a reference.

This simple tool has its pros and cons like anything other, but I find it exciting and interesting that simple math can do a job in seconds that it may take many humans eons to do.

If you are interested, you can find my code here.

I would like to think that closest neighbor cat looks a bit like this.

Photo by Tran Mau Tri Tam on Unsplash

--

--

Emma Rose

I am an aspiring data scientist studying at Lambda School.