One Stop for KNN

9 min readFeb 17, 2022

Nearest Neighbors? How near are these neighbors? Well, I hope they don’t bite!

This article is also a summation of all the web articles I went through in order to understand the KNN algorithm and the perks of using it. Well, there aren’t many perks to be honest, but the algo is super dope.

KNN which stands for K-Nearest Neighbors, is a Machine Learning algorithm popularly used for Classification. It works similar to the saying : “Birds of same feather flock together”.
And well, that is it. That is what KNN is and how it works in a nutshell.
Simple. Bye!

Dude?
~Okay Okay!

OUTLINE:

What is KNN?
Properties of KNN
KNN Algorithm
Choosing an optimal K
Distance metrics
Data preparation for KNN
Pros and Cons of using KNN

1. So, what is KNN?

K-Nearest Neighbors (KNN), is a supervised machine learning algorithm that is used for classification and regression problems. It is mostly used for classification problems. It simply uses the class of the k-nearest neighbours in order to predict the class of a new point.

K-nearest points from the test point are measured by a distance formula and then, the label/value of those nearest points are determined. The gist of the problem statement is used to classify/predict the test point.

By gist, I mean:
For Regression, we return the mean of the K-nearest points, and that mean is the prediction for the test point.
For Classification, we return the mode of the K-nearest points, and that label is the prediction for the test point.

In the above picture, we need to classify the new test point in either of the classes A or B based on the features/predictors — x1 and x2. Let’s suppose we choose a K value of 5. The 5 nearest points to the test point are selected. Since among these 5 nearest points, most of them(3) fall into category B, therefore the new test point will be labelled as category B.

We made this prediction by selecting the mode of the k-nearest points because it was a classification problem. Similarly we can evaluate the mean of the k-nearest points for a regression problem.

2. Properties of KNN

Disclaimer : Do not freak out half-way reading the below 3 points. Kindly hold down your horses.

Memory-based learning or Instance-based learning: KNN algorithm works in a way that it uses/memorises the entire training dataset inorder to make prediction. When an instance of a new test point comes, it generalises this new instance based on the similarity from the training data. It is called Instance-based, because it builds the hypotheses from the training instances.
Lazy learner: KNN is often termed as a lazy learner since the model is not learnt at the training time. The learning happens when a prediction is requested on a new test point. This makes the training process fast and the testing process slow and memory consuming due to the computation happening under the hood at the time of prediction.
Non-parametric learner: KNN is a non-parametric algorithm. This means that it doesn't have to evaluate or minimise/maximise any type of parameters — coefficients and constants, like in Linear Regression/ Logistic Regression, eg: (a, b; for y = ax +b)

I know it might’ve gone like a bouncer to many, but this will be clear when we look at the working of the algorithm below.

3. Algorithmic working of KNN

The algorithm of KNN is quite simple and easy to grasp:

For each point in the training dataset, calculate it’s distance from the test point (Euclidean Distance is most commonly used for this purpose)
store these distances in an ordered dict/ list
sort the distances in ascending order of their values
select the first K points (k-nearest to the test data) from the list
based on problem type(regression/classification), make appropriate predictions from these k-nearest points

Since the algorithm deals with getting the distance of these points and storing them at a space(list/ordered-dict), the algorithm is memory consuming, as for larger datasets (million data points or so), the memory used to store the distances will be large.

Since the algorithm needs/memorises all the training data points for calculation of distance from the test point, it thereby is a memory-based learning technique.
Since the distance of the test point is calculated from rest of the training data points, it means that requirement of a test point is a must for moving forward with the computation and sorting of distances.
Therefore, KNN is termed as a lazy learner, because the learning starts when a test point comes in for prediction.
Lastly, we are only computing the distance of these points from the test point. No other parameter, like a coefficient of some independent variable is being evaluated. It just deals with finding the distance, and hence is a non-parametric learner.

Being a non-parametric learner, this algorithm does not have a loss or a cost function as well.

Ohhhh!!!
~Yes, yes, I know!

4. Choosing an optimal K-value

Now this is the fun part!

Choosing an optimal K is a crucial step in order to make your algorithm perform on steroids according to the type of data that you might have.

Here are some points we need to consider before choosing an optimal K:

Selecting an odd numbered K value is preferred. Because, it is easier to find the mode and the median of the data with odd numbers.
As K decreases, the model tends to have a less bias and a high variance and hence can overfit the data.
For a small K, the algorithm will select the most nearest points to the test point. This in turn will reduce the error in predicting, since nearer points will classify the test point in the absolute vicinity.
By absolute vicinity, I mean that for a test point surrounded by 1000 points, a small K (say 2) will only look at the two nearest points to the test point. This means that we are not generalising our model and are capturing noise along with the information. This will reduce the error(Bias), but will increase the variability of our model (Variance) for a new test point. Hence, the model is overfitted.
As K increases, the model tends to have a high bias and a low vairance, and hence can underfit the data.
As K increases, the model starts to loose noise, and tends to become more general to the new test points. But, after a certain point, the model starts to loose information. By this I mean, that for a high K value, the model will generalise to an extent that the points will be misclassified. And therefore, the model underfits the data.

As seen in the above picture, a small K fits the datapoints too well, and therefore a new point will have a higher chance of wrong prediction, since the model is not generalised. A large K on the other hand, will be too flexible in its predictions such that there will higher chance of misclassification for a new test point, since the model is too generalised.

Therefore, we need to choose a value of K such that the bias-variance trade-off is maintained. The model should neither overfit nor it should loose information and underfit the data.

And for the above reason, K is chosen with a cross-validation test, and with a popular technique called the Elbow method.

Well, you can check out cross-validation on the internet and here’s a great link for it :

Train/Test Split and Cross Validation in Python

Hi everyone! After my last post on linear regression in Python, I thought it would only be natural to write a post…

towardsdatascience.com

But, in a nutshell, cross-validation is selecting multiple portions of your data for training and testing your model to get a generalised result.
And we can use this technique to find an optimal K. We can select various K-values for various portions and then check out which K is performing the best.

The Elbow Method, is a technique with which we can select a K-value by looking at the result given by every K.

The steps for an elbow method are stated below:

We choose a K in a range(say from 1 to 10)
For every K value in this range, we fit the KNN model on the training set and get the result/error on the test set.
This “result/error” can be anything — accuracy score, recall-score, MAE, RMSE, etc. depending on the problem statement.
The result/error is then plotted, with K on the x-axis.
The K, after which the error/result becomes constant, or decreases barely is selected.
The point where this happens is called the elbow point because it forms an elbow like shape. And we choose the K at this elbow.

5. Distance metrics

We generally use the following distance metrics to measure the distance between points in KNN:

1. Euclidean distance — most popularly used, generally called the L2 norm, is given by:

2. Manhattan Distance — called the L1 norm, is given by:

3. Minkowski Distance — called the LP norm, is given by:

~Well, that is all the math you have to do for KNN!

6. Data preparation for KNN

Scaled Data — since we use distance metric for calculations, we need the data to be scaled.
Dimensionality reduction — since more features increase the complexity, it thereby increases the computation which is bad for KNN coz it is a lazy-learner and hence would require more time and more memory. Hence we should try to reduce the dimensionality using feature selection techniques like PCA, LDA, SFS etc.
Missing Value Imputation — the missing point might have been an optimal answer for the new test point, hence we should always impute missing values beforehand.
Outlier treatment — result in wrong predictions, since a point near an outlier can be classified/predicted same as the outlier, which might be incorrect. Therefore, we should always treat the outliers before moving forward with the algorithm

7. Assumptions of KNN

Data is in feature space — KNN assumes that the distance between the datapoints in feature space can be evaluated using various distance metrics like Euclidean, Manhattan etc.
Data point are vector with class labels — KNN assumes that each data point is a vector with either +ve or -ve class labels. But KNN can also work with arbitrary number of classes as well.
Similar things exist in close proximity — KNN assumes that similar things exist in the feature space with close proximity

7 . Pros and Cons of using KNN

Pros:

easy to implement
versatile (used for regression and classification)
gives high accuracy on smaller datasets, if K is chosen carefully
can be used for non-linear data

Cons:

requires more time in testing (lazy-learner)
high computation is required for large datasets since testing takes time
high memory consumption for large datasets
sensitive to scale of the data
sensitive to outliers in the data

Well that was all about KNN. And I guess that is pretty much all about KNN except the case when you really dive into research and find how KNN is implemented in different languages, and what is the memory and time complexity of KNN algorithm, and some more things that they teach you in a masters degree. Well, all this is beyond the scope of this article and would probably be there in some statistical book on Machine Learning. So, good luck finding the book in some fancy-ass library.

And always remember, Thomas Shelby is busier than you, so better find some work eh!

Check these out as well:

One Stop For Logistic Regression

Logistic Regression? Why is it called Regression? Is it linear? Why is it so popular? And what the hell is log odds?

medium.com

Classification — Evaluation Metrics

Evaluation metrics are what make a Machine learning model show how evil it was under the hood.

medium.com

A One Stop for Support Vector Machine

Support Vectors? Machine? And, why isn’t Oswald Mosely dead?

medium.com

One Stop for KNN

1. So, what is KNN?

2. Properties of KNN

3. Algorithmic working of KNN

4. Choosing an optimal K-value

Train/Test Split and Cross Validation in Python

Hi everyone! After my last post on linear regression in Python, I thought it would only be natural to write a post…

5. Distance metrics

6. Data preparation for KNN

7. Assumptions of KNN

7 . Pros and Cons of using KNN

One Stop For Logistic Regression

Logistic Regression? Why is it called Regression? Is it linear? Why is it so popular? And what the hell is log odds?

Classification — Evaluation Metrics

Evaluation metrics are what make a Machine learning model show how evil it was under the hood.

A One Stop for Support Vector Machine

Support Vectors? Machine? And, why isn’t Oswald Mosely dead?

Written by Priyansh Soni