KNN from scratch — Easy Peasy

Priyansh Soni
4 min readJun 8, 2022

--

Photo by Daniel K Cheung on Unsplash

This article will walk you through the working of KNN with ease in absolute python. Absolute python is a nice way of saying that I just like python. #Java sucks.

I have discussed the working, algo, properties, assumptions and whatnot about KNN in the below link:

OUTLINE —

  1. Algorithm steps
  2. Algorithm steps with code blocks
  3. Entire code

1. KNN Algorithm

If you are not familiar with the background of how KNN works, so do refer to the above-mentioned link. It’s barely a 10min read.

And for those who do, the algo is fairly simple :

  • For each point in the training dataset, calculate its distance from the test point (Euclidean Distance is most commonly used)
  • store these distances in an ordered dict/ list
  • sort the distances in ascending order of their values
  • select the first K points (k-nearest to the test data) from the list
  • based on problem type(regression/classification), make appropriate predictions from these k-nearest points

2. Algorithm Steps with Code

Our aim for the code is to make 3 functions :

  1. eulideanDistance — for calculating and returning the euclidean distance between a test and a training point
  2. fit — for fitting the training data to our model
  3. predict — for predicting the output class for a given test point.

Dataset: Suppose we have the famous IRIS dataset with 4 numerical features and we have to output the class to which the flower belongs. So the dataset head will look like this:

Now a training point for this dataset will be like : [5.1, 3.5, 1.4, 0.2, Setosa]
Similarly, a test point for this dataset will be like : [4.7, 2.9, 1.8, 0.5]

Therefore each training point and test point instance will have 5 and 4 features respectively.

  • euclideanDistance
    We will make a function that calculates the Euclidean Distance between a test point (testPt) and a training point (trainPt). This distance would be the sum of the Euclidean distances of every feature of the test point with the corresponding feature of the training point.
    This function will be made private by assigning ‘__’. This means that it cannot be used outside the class.
  • fit
    We will make a function that will assign the training data (X_train, y_train) to our model. Since KNN is a lazy learner, the training data is just stored and no operation is performed during training. The calculation of Euclidean distance and the nearest neighbors are done when a test point comes in during testing. Therefore the fit method just stores the training data into some variables of the same name, which then are used during prediction.
  • predict
    We will make a function that will predict the labels of every test point (X_test) passed into it. The steps for making a predict function like this are stated below. Refer to the code step-by-step for a better understanding :
  • Iterate over every point in X_test. We iterate over the index.
  • Iterate over every point in X_train. Index only. Now for each test point, we iterate over all the training points.
  • For one test point, we find the euclidean distance from every training point. We store this distance in a dictionary/map with keys as the training point index and values as the euclidean distance.
  • The loop helps us repeat the above step for every test point.
  • Now we sort the dictionary in increasing order based on the values of Euclidean distance. Hence for a particular test point, the nearest neighbors will come at the start of the dictionary.
  • Then we get the k-nearest neighbors for a particular test point by selecting the first k values from the dictionary. These will be training point index values since we iterated over indexes in the loop.
  • Now we generate the labels corresponding to these k-nearest neighbors. This is done by getting the values of y_label from y_train for each one of the k nearest training point indexes.
  • Then we store the mode of the generated labels for each k nearest training point into another dictionary with the key as the test point and value as the output (mode of labels).

3. Complete Code for KNN

A more generalized and usable version combining all the above code blocks into a class can be seen below :

The code for a regression problem is similar. We just have to get the values from y_train for every k nearest training point to the particular test point. We then take an average of those values. This average will be the output value for this particular test point.

--

--

Priyansh Soni

Bet my articles are the best easy explanation you could get anywhere. I am a product enthusiast nurturing technology to transform change.