ML Algorithms From Scratch — Part 1 (K-Nearest Neighbors)

Published in

TheCyPhy

10 min readOct 12, 2020

Have you been so much lost into using model.fit and model.predict that you’ve forgotten the underlying principles of ML algorithms? If yes, you’ve come to the right place.

Introduction

A lot of times, we start leveraging the Machine Learning libraries so much that we tend to forget the underlying concepts behind them. Although Scikit-Learn, MLXTEND, and many such libraries have made our lives easier, at the same time, they have reduced our ability to think. I am not saying that using these libraries is a bad thing, rather it’s actually a smart thing to do. But at the same time, if you are getting rusty with the concepts, it’s time for you to think again.

Learning a new concept can be intimidating at first, but if you try to get into the crux of that concept, you’d find that it’s actually very easy. Most of the ML algorithms are nothing but the concepts built around our high-school mathematics.

What I often see happening that is when we learn a new concept, say we learned KNN today, we immediately tend to jump to applying it on some real-world dataset using sklearn. In this process, we lose out on the nitty-gritty details of that concept. There are lots of failure cases that we could’ve encountered if we for once had thought about implementing that algorithm from scratch.

In this series of blog posts, I will try to explain some of the most common Machine Learning Algorithms. We’ll be starting KNN in this blog, from a mathematical perspective first and then we’ll try to implement it from scratch using simple Python code. Note that here we won’t be working on optimizing the code for performance. For that task, we already have the optimized libraries at our disposal. Rather, our motto is to develop thinking as to how to approach the algorithm. So without further ado, let’s get started with K-Nearest Neighbors.

K-Nearest Neighbors

Neighbors usually show lots of resemblance.

Each one of us probably must have started their Machine Learning journey via this algorithm, because it’s so intuitive and basic, that anyone can grasp it.

Here, I am assuming you know the fundamentals of the Machine Learning like what a dataset is, what target variable is, train set, test set, etc. as I won’t be able to go into details of those, lest you want me to make this blog post too long to read.

K-NN comes under the Supervised Learning branch of Machine Learning, wherein for each data point in our dataset, we are also given its class label, using which our model tries to learn a function to map these input data points to the output class label.

For example, suppose you are trying to teach your child to distinguish between a dog and a cat. So what you’d do is show your child the images of dogs and cats separately along with telling him which one was which. In this case, you are supervising the learning of your child by giving him both the input image and its label be it of a dog or a cat, such that in the future if your child would look at some dog which may look similar from the pictures, he would be able to identify it based on what he had learned from you.

Coming back to K-NN, as the name suggests, ‘K-Nearest Neighbors’, this algorithm tries to find the nearest neighbors in the spatial region for a given point. Let’s take an example to understand it better. Let us say that you’ve somehow entered into a random classroom in your college, and you don’t which branch it belongs to. What’d you do? You’d ask the students of that class about their branch. So, in this case, what you did was to look at the people around you and then based on their attributes, made a prediction about the branch.

This is exactly what K-NN does. It intuitively looks at its nearest neighbors (based on some similarity/distance measure) to classify a given query point. Let us try and look at the mathematics behind it.

Mathematics Behind K-NN

The basic assumption that the K-NN algorithm makes is that our data is present in a feature space, of d-dimensions. Here, ‘d’ can be thought of as the number of independent features that describe our data. Each feature of our dataset is present in a space of its own, which corresponds to each dimension of the d-dimensional space that we are talking about.

This assumption holds well for most of the time, but if this dimensionality becomes too large, it leads to the issue of curse of dimensionality, which we’ll probably discuss in some other blog.

The distance metric usually chosen is the Euclidean Distance, which we all have been studying since our 6th Grade I believe. Still, for the sake of the blog, I’ll write the equation below.

The algorithm then finds the K-Nearest points in this d-dimensional space to make the decision about the color of class label.

Distance Metrics

One often encounters the question regarding the distance metric to be used. Well, there is no single right answer to this question, and it actually depends on the use-case. Generally, we take Euclidean Distance as discussed above, however, there are certain failure cases for it, in high dimensional space. Following are some of the distance measures you can look at, we won’t be going into further details of these.

Manhattan Distance
Euclidean Distance
Minkowski Distance
Hamming Distance
Cosine Distance

Each of the distance metric mentioned above is used in a certain set of conditions, for example, Cosine Similarity is generally preferred for text-data when we don’t care much about the magnitude, just the angle between vectors, Hamming Distance for Gene Sequences, etc.

Step by Step Procedure

Let us look at the step by step procedure of how K-NN works.

We are given a dataset, D = {x, y} where x is the features in a d-dimensional space (taking 2 here, for simplicity) and y is the class label, which can be either ‘Red’, ‘Blue’, or ‘Green’. Our task is to classify a given point into the 3 possible colors.
As we had talked about earlier, for supervised learning, we try to find a mapping function that maps our input to output, which is y here. The aforementioned task happens during the training time of the model. However, K-NN is a lazy learning algorithm, meaning that it doesn’t actually learn anything during training, it just holds the inputs and outputs of the training data into RAM.
During the prediction/test time, we send our query point Xq. Now, for each of the points in our dataset, we calculate the distance of that point from this query point Xq.
We provide the K value, which defines how many nearest neighbors to look for before making the prediction. So, after finding the distance of the query point from each of the training points, we sort these distances in ascending order.
Next, we only keep the K-nearest neighbors of our query point Xq, based on the sorted distances.
We look at the color of each of these nearest neighbors and using the majority vote, classify our query point as either of ‘Red’, ‘Green’, or ‘Blue’.
We repeat the above steps for the rest of the query points.

Source: https://www.researchgate.net/publication/321751429/figure/fig6/AS:668472602284040@1536387696362/An-illustration-of-K-nearest-neighbor-model.jpg

Let’s look at it from the above diagram’s perspective. Suppose our query point is Xu, and we have chosen K = 5. So, as we can see from the image, it finds the 5 nearest points from our query point as annotated by the black arrows. Now since we have 4 ‘Red’ points, and 1 ‘Green’ point as our 5 Nearest Neighbors, we classify our query point as a ‘Red’ point with a probability of 4/5.

Choosing the right K

This is again a question to which there is no single right answer. It actually depends on the dataset to the dataset. For one dataset, a K of 7 could work well, while for some other dataset, a K value of 11 could be better. This value of K is usually found out by doing Cross-Validation (I’ll provide a link in the references which you can check out to learn more about it).

Enough Theory, Let’s Code

We’re almost there. Thank you for being patient.

Now coming to the main topic, to which this series is more focused on, i.e. implementing this algorithm from scratch by ourselves, without the help of any library except for the essentials, viz. Numpy and/or Pandas.

We will use sklearn’s datasets module to generate a two-class classification toy dataset and split it to train and test with a test size of 30%.

Making a 2-Class classification dataset, with 5 columns/features, and splitting into train and test.

Looks like we are all set to start building our K-NN Code from scratch.

We will first create a class named KNNClassifier, as we are only implementing it for Classification purposes for now. It may further be extended to regression, by just replacing the mode with mean in the predict function. This class consists of 5 methods, which are:

__init__ — to initialize our class variables.
fit — to load the training data to RAM
predict — to predict the class label for the given query point
predict_proba — to predict the probability of each class label
eucl_distance — to calculate the Euclidean distance between the query point and training point

Here’s the coded implementation of the same:

Defining the KNNClassifier Class.

To use it, we will be importing this class from the module named KNN.py, and use it with the dataset that we’ve created above.

Making predictions with the custom implemented KNN Classifier.

Here are the outputs from our custom implemented K-NN Algorithm.

We will cross-verify it with sklearn’s KNN, and see if our implementation was correct or not.

Code snippet for generating predictions from sklearn’s KNeighborsClassifier.

Output from sklearn’s KNN Implementation

Thus, our outputs match exactly with the sklearn’s KNN, this implies our custom implementation was successful.

Important Points

We have followed the exact same process as mentioned in the step by step procedure. One thing to note here is that the computation for this implementation won’t be as fast as sklearn’s because this code is not optimized in any way whatsoever. We have coded it just so that we could get a gist of the internals of the algorithm. We could also implement more functionalities to it by adding the class weight parameter, returning the nearest neighbors, parallelizing it across the cores of CPU, etc., but that was not the motive of this blog.

Anyway, I’ll leave it for the readers to try to optimize the code. Finally, I’ll try to wrap it up with few Pros and Cons of the K-NN algorithm.

Pros of K-Nearest Neighbors Algorithm

It is one of the most basic ML algorithm you will come across, and it’s even easier to implement it from scratch, unlike complex algorithms like Decision Trees or Ensembles.
It works fairly well when the dimensionality of the data is not too large.
The outputs from KNN are interpretable in a way that we can look at the nearest neighbors and compare the query point with them.
It can also give probabilistic output for the predicted class label, which shows the confidence of prediction for a given query point.

Cons of K-Nearest Neighbors Algorithm

In spite of all the pros, K-NN comes with a bunch of cons as well. Some of them have been listed below:

As the dimensionality of data points increases, the performance of KNN degrades, attributed to the curse of dimensionality.
The test time complexity for KNN is very high, O(nd).
The test space complexity for KNN is also very high, O(nd), because it has to store all data points in RAM to compute distances from the query point.
Knowing which distance metric to use for a particular type of data is essential for the KNN algorithm to perform well.

For these reasons, KNN is not suitable when we have large number of data points and/or large dimensionality.

End Note

With this, we have completed the implementation of our first ML algorithm from scratch. It was quite easy, wasn’t it? In the next post of this series, we will look at the nitty-gritty details of Naive Bayes and will try to implement it from scratch as well.

Thank You for Reading.

Keep Learning!!