K-Nearest Neighbor Classifier; From Scratch, in Python.

Jonathan Mendoza
Can It Be Predicted?
6 min readAug 31, 2020

This classifier has one of the simplest assumptions; points with similar attributes are in the same class. While this assumption is not always true, it is a quick and easy way to efficiently classify data. By calculating the distance between points in n-dimensional space we can attempt to classify a data-point relative to its neighboring points.

Photo by Darius Bashar on Unsplash

The idea: given a data-set for training, you can pass in a label-less data-set containing the attributes of an observation, and the model will predict what the label should be. This is no easy task for humans to do on our own. For something as simple as flowers, there are so many visual similarities that it would surely cause a layman to become easily confused on the species presented. By taking measurements of the sepal length and width, as well as the petal length and width, we can harness the power of machine learning to aid us in our endeavor. Conveniently, the iris data-set from the UCI Machine Learning Repository contains these measurements, so, we’ll take a look at the K-nearest neighbor(KNN) algorithm through this lens to gain a better understanding of how the classifier functions.

Images obtained from fs.fed.us

Creating the algorithm

In order to find the distance between two points we’ll call upon the Euclidean distance formula, which produces the straight line distance between two points in n-dimensional space.

Euclidean distance formula

In python, we can use numpy’s linear algebra (‘linalg’) method, which contains the normalize (‘norm’) function. We’ll use this to create a function which calculates and returns the euclidean distance for two points and call this function in our homemade KNN classifier:

Euclidean distance in python

There are other formulas to calculate the distance between two points, but for the purposes of this article, we’ll keep it as simple as possible. Now that we can calculate the distance between any two arbitrary points, we can focus on figuring out what the nearest neighbors are. The ‘k’ in KNN represents the number of neighbors the user would like the algorithm to use to calculate the class label. Lets create a python class to package our functions:

Initialize KNN class

Next, we can create two functions to help us calculate and return the ‘k’ nearest neighbors:

Calculate and return ‘k’ nearest neighbors

In the function k_neighbors, we loop over the length of the training data-set, calculating the distance between each of our training points, represented by each row of our data-set. The distances are added to a list, and sorted in ascending order. Then, we pull out and return the top ‘k’ neighbors distance and their label. In the function get_nn, we first cast our training data and test data to arrays, in case a different data-type is passed. Next, we loop through our target data-set and call the k_neighbors function to calculate the distance between each row of our target data and training data. We now have a list which contains a list of tuples, each representing our ‘k’ nearest neighbors. This is the meat of the KNN algorithm, but we’ll take it a step further, and create a classifier based on the results.

In order to classify our target data, we count the number of times each known class neighbors our test point, then return the class with the majority count. To do this, we create a function which calculates the number of occurrences of an element in a list.

Count occurrences of an element in a list

The function, vote_count, will iterate through a list and add the element to a python dictionary and/or increase the value of the element in the dictionary, then, returns the dictionary. Lastly, we’ll need to create our fit, predict, and evaluate functions to complete the classifier.

Fit, Predict, Evaluate functions for KNN classifier.

The fit method takes in the training data, including the labels. The predict method takes the target data-set, calls the get_nn function, which returns our list of ‘k’ neighbors. Then we iterate through each row of neighbors and get the vote_count for each class present. Lastly, we append the class with the max value to our list of predictions. If we have the class labels for our target data, we can then evaluate our predictions. Finally, we package the functions into the KNN class and we’re now ready to make predictions on any number of data-sets.

Results

In the case of the iris data-set, this algorithm performed exactly as the sci-kit learn model, yielding a result of ~98% accuracy. The main difference between the sci-kit learn model and this one is the run-time, where the sci-kit learn model runs at about 0.01 ms and the one presented runs at 0.04 ms.

Now, to test our new KNN class, we can save it to a file named knn.py and run the following code in python, within the same folder:

Using iris data-set to test KNN model

First, we’ll want to import pandas, our model, and train_test_split. Then we can begin working on processing the data-set; drop unnecessary columns, and separate our observational data and our label data. Next, we want to split the data into a training set and testing set, using the sklearn library.

Lastly, we initialize, fit, predict and evaluate our algorithm:

K-nearest neighbors classifier, from scratch.

in only a couple lines, we can see that our algorithm yields great results for this data-set, with just under 98% accuracy, and if we want to compare it to the sci-kit learn algorithm we can do so in tandem:

K-nearest neighbors classifier, sci-kit learn.

By importing the KNN classifier from sklearn, fiting and evaluating the predictions through the score method, we get the exact same result as the one we built from scratch!

Conclusion

This simple, yet powerful example of a supervised machine learning algorithm can be altered to meet the needs of your research and analysis. Understanding the underlying work which goes into creating the predictions allows us to determine what data this may or may not be suitable for. According the the sci-kit learn website, this classifier is not suitable for datasets which contain more than about 10 dimensions, as the accuracy begins to suffer greatly. Also, data which has many similarities between different classes may not be suitable for this algorithm, as you will get a lot of false predictions due to the nature of the algorithm.

You can find the code for this project on my github.

--

--