The “Lazy” Algorithm: k Nearest Neighbour

Manali Shinde
One Datum At A Time
6 min readFeb 23, 2018

--

Predictive Modelling is most likely one of the most used tools in data science. While it can range from being somewhat simple, to scientists using very complex machine learning algorithms, predictive modelling may just be the entire reason why data science has become just a popular field — one may say it is the essence of data science.

One of primary predictive modelling algorithm that many beginner data scientists may learn is kNN: k nearest neighbour (can you tell I’m Canadian?). The basis of k nearest neighbours is that is allows for instance based learning. You are able to approximate the location of a data instance, according to it’s position to the surrounding past data instance, or observations. A prediction of the most seen attributes in a data instance, is returned for an unseen or “new” data instance.

One really simple way to go about understanding this is through a great 5 minute video by Rapid Miner Inc. on YouTube. In sum, the video describe an instance where a cup has fallen in a particular manner. You then come across similar cups, perhaps of different shapes and sizes, and you want to predict if these new cups will also fall and break in that same fashion. You may then want to plot each cup’s characteristics based on the material it’s made of, height, and maybe thickness. According to where the new cup’s data are placed, you can then predict how the new cup will behave, according to data you have already observed. By measuring the distance and response of the new cup, you can group it in a category according to it’s most occurring characteristic. Simple, right?

Below, is an example of the type of cluster you can observe in an kNN algorithm

The reason why kNN is so powerful and popular is because it does not assume anything from the data, the tool that it uses is the distance that is calculated between two unique data instances.

The Flower Classification

If you are a data science buff, you may have already come across the popular Iris flower species data set that is most used when testing kNN.

In the problem, you have around 150 iris flower observations, from three different species. Measurements of sepal length, sepal width, petal length, and petal width were taken and the prediction was to understand if the flower was from the species setosa, versicolor or virginica.

The Code — i.e. The Fun Part

When starting this model, it is important to first load the data into Python3, and split the data into training and test datasets. The training dataset will be the data instance that were are trying to find the species of, by comparing it to the test set. In order to consider the model a “success” it is important for it to have above a 90% accuracy, usually, 96%+ is determined the best.

Implementing kNN

There are around 6 steps to implementing kNN:

  1. Loading the data file and splitting the data: this is described above
  2. Similarity: measuring the Euclidean Distance between two data instances
  3. Neighbours: finding the nearest neighbour in a test set using the already found Euclidean distance.
  4. Response: the predicted response of the test instance based on the neighbours
  5. Accuracy: the percentage accuracy of the predictions
  6. Main: bringing all the functions together so that Python can give you an output 1
  1. Loading the File

Load the iris dataset and split it into training and test sets. The training set can be used to make predictions while the test sets is used to measure the accuracy of the predictions.

2. Similarity

One of the most important function in the kNN algorithm is finding the Euclidean Distance — ED. This can be done mathematically by finding the square root of the sum of the squared differences between 2 arrays of numbers. While it is tedious to understand what this all means when put in words, the best way to understand what this code is doing is through the equation below:

Euclidean Distance

The line; distance += pow((instance1[x] — instance2[x]), 2) is telling python to calculate what is inside the square root, then, return the square root of the found distance, to give us the ED. An array is basically the coordinates of the data instances on a graph. In order to use this successfully however, it is important to ensure that both the arrays are similar in terms of the units used, and they are numeric. For instance, using two difference sepal lengths.

3. Neighbours

Once the ED is found, it is then important to find the k most similar instances for a given unseen instance. Meaning that is is important to tell Python to find the distance for ALL similar instances by selecting a subset of values with the smallest distance measurements. When looking at a new data instance, we must find the closest cluster where the unseen distance is located in order to conduct the prediction.

4. Response

The two steps above help us get an idea of the whereabouts of the data instance. Next, it is important to see how the data instance is responding to it’s neighbours according to the characteristic that it is most similar too. This is done by telling Python to take a “vote” of their major attribution. If the response in the neighbour is similar to their attributes, they receive of vote of 1, and the majority vote is then the prediction with which we can determine which species the new data instance belongs to.

5. Accuracy

Finally, the only way we can know that this prediction is successful and something we can actually use is by evaluating how accurate it is. This is done by comparing the training predictions to the test set, if it is correct, it will receive a value of 1. Then you can find the percentage accuracy of the prediction. We want to aim for more than 90%, and hopefully it should reach around 96%.

6. Main

Finally, once we have defined all our functions, by telling Python how to find the length, neighbours, and accuracy, we can then move on and tell Python to give us the actual prediction. This code is the main kNN algorithm, however, it cannot be written unless we define all the functions above in order to make this code simple and easy to understand. Once the print main, we receive this result:

The result compares the predicted data instances to the actual test instances and gives us and accuracy of about 95.5%. Which means that our prediction was pretty successful in finding the kNN and determining which cluster each data instance belongs too.

What’s Next?

After finding the predictions and accuracy, we can then move on to interpreting the data. This can be done through a few different tools:

a) Regression

b) Normalization

c) Alternative Distance Measure

While these tools are beyond the scope of this article, for more information on these — and the main resource for this tutorial you can visit: https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/

Hope you found this tutorial useful and interesting! Stay tuned for more fun coding and case studies, and, if you have any tips or tricks, or tools of the trade you’d like to share, please do leave it in a comment below!

Thank you, and happy coding :)

--

--

Manali Shinde
One Datum At A Time

A health informatician and health data analyst. I am a writer, dancer, and public health advocate. Join me on my journey!