KNN Classifier from scratch

Shashank Parameswaran
3 min readMar 19, 2022

--

This article intends to help the reader understand how to code the KNN algorithm from scratch in Python. The concept can be used and implemented in other languages as well.

The K nearest neighbors algorithm is a supervised data classification method. It will classify any data point into the group that is more commonly found near it. The value of K decides how many points the algorithm would consider. For example, if K=10 and a point P lies near 6 points of group A, 3 points of group B, and 1 point of group C, then the algorithm would put point P in group A.

Image Source

KNN is a non-parametric method which means that it doesn’t care about the underlying data distribution. However, the assumption is that any point can be explained by looking at its neighbors. Also, it is a lazy learning algorithm, which means that it will build a model only after it gets the input that requires to be classified (unlike OLS). It makes sense as it needs to use both the test dataset and the training dataset to classify any point in the test dataset.

How does it work?

Image Source

The working is quite simple. We need to get K points from the training dataset that are closest to the data point that needs to be classified. K is a hyperparameter that we need to specify. Here, by closest distance, we mean Euclidean distance.

Here is how we get the Euclidean distance for any 2 points (x1, y1) & (x2, y2) (don’t get confused with the ‘y’ here. This does not indicate the response variable ‘Y’).

Euclidean distance formula

The above formula is for 2 variables. This can be extended to any number of variables by adding them inside the square root term. You could use other forms of distances as well like Manhattan distance, Hamming distance, or Minkowski distance.

How to code?

We will use a sample dataset of 200 observations of each of training and test dataset. Here is the link to the datasets. The dataset has 2 predictor variables (X1 & X2) and 1 response variable (Y). The response variable has 2 outputs- 0 & 1.

Below is the knn_classifier function that implements the KNN classifier.

Please note the test dataset does not have the ‘Y’ label. We will not be checking the accuracy of our model, but we will compare our predicted results with an already existing and implemented version of the KNN algorithm — KNeighborsClassifier from sklearn.

As you can see, almost all of our predictions are the same as the KNeighborsClassifier function.

K is the major hyperparameter here. You can train try out different values of K and decide which one fits the best. Advantages of KNN algorithm are that it is easy to understand and implement, can be used for non-linear data, can handle multi-class classification and you don’t need to build the model first.

However, there are some disadvantages. It is computationally expensive as you need to get the K nearest neighbors for all the test data points. As the number of variables increases, this will become a huge task. Also, the prediction could be slower when the number of data points is high.

--

--