K Nearest Neighbours (KNN) explained
When we hear the word machine learning we see complex calculations, assumptions parameters, and partial derivatives inside of our head. But , this one is different. A simple, easy and lazy one. No complex calculation , no partial derivative, no hyperplane.
What is k nearest neighbour :
K nearest neighbour or simply KNN is a machine learning algorithm that is used primarily for classification tasks, although it can perform regression task. It estimates the likelihood that a data point will become a member of one data group or another based on what group the data points nearest to it belong to.
How KNN works:
Like I said before KNN is a lazy learning and non parametric algorithm. It’s called lazy learning algorithm or lazy learner because it doesn’t perform any task when while training. It just stores the data. That’s it. No calculations, no probability estimation, no partial derivatives or no training model building. It does nothing. It only builds a model when you test the model or a query is performed. like, you asked the model which class it belongs to by giving an unknown data point.
Let’s understand the working of KNN :
KNN use feature similarity to predict the the values of new data points. How it does that? Let’s understand that by a popular pseudocode. (I am pasting this by modifying a bit from what every other article writes)
Data. Load your data set. It’s obvious though.
Step — 2:
Perform data preprocessing. Handling missing values, perform dimensionality reduction if required.
Step — 3:
Choose a value for K.
The ‘K’ in ‘KNN’ stands for how many neighbouring point you want to consider for voting. We will discuss this on later in the post. It’s an integer. Not too small not too big.
Step — 4:
4.1 calculate distance between your query example and every training row e.g. existing data points used in training. You can measure the distance by using any of the methods: Euclidean , Manhattan , minkowski or Hamming distance.
People use the Euclidean distance most and it’s simple.
4.2 sort the distances in ascending order.
4.3 choose top k rows or entries the from the sorted array.
4.4 get the labels of those k entries, find the most frequent class among them(Voting), that will be our predicted class for our query example, assign that class for the query example.
4.5 *For Regression calculate the mean of those k nearest data point. That will be our label for query example.
It’s a simple procedure for finding the most frequent class in the neighbourhood . Like it is in the real world election. When counting the votes in a region the officers in charge count the finite number of voting responses. And classify which candidate got how many votes. And sort them in descending order. Like this
John : 14369
Shrikant : 9650
Laura : 8771
So john is the winner. Its the same procedure in KNN. Out of those K nearest data points we find the class which occurred the most number of times.
In this picture we can see that out of 5 nearest data points 3 of them are class red and 2 are class blue. So we will pick red class for our query .
How to choose optimal value for k:
There is no specific method for choosing ‘K’. you have to choose it manually. So we have to experiment with different values of ‘K’ , train the model multiple times and measure the model performances and pick the best performing ‘K’ to go forward.
When we are dealing with binary classification it’s better to choose an odd value for ‘K’. otherwise it may happen that number of neighbour of the both classes are same. In the same manner the value of ‘K’ must not be a multiple of number of classes present. Another way to choose ‘K’ is sqrt(n) where n is the number of sample present in the training.
‘K’ should not be very low like 1 or 2. it will result in overfitting. In such cases it will be noisy and subjected to effect of outliers.
in this pic out of four data points around our query three are red and one is blue, but that one blue is the closest. If k = 1 it will choose class blue as nearest neighbour and classify as blue, but we know that’s not the case.
K with larger values, in most cases, will give rise to smoother decision boundaries. But it should not be too larger. Otherwise, groups with a fewer number of data points will always be outvoted by other groups. Plus, a larger K will be computationally expensive.
When we have a large amount of data, that means when we have too many features to deal with there is a chance of overfitting causing inaccurate model. It will be harder for our model to measure the distance. One more problem is that many data point can be equidistant.
So the the procedure dimensionality reduction is the way. In such case scenario consider principle component analysis (PCA) and feature selection in data preparation step.
We have not written any article on these topics but soon will.
Advantages and disadvantages of KNN:
1. It’s pretty simple to implement and easy
2. There’s no need to build a model, tune several parameters, or make additional assumptions.
3. It can both be used as regression and classification.
4. It can handle multi class classification pretty well
1. Requires high memory storage
2. Need to determine a suitable ‘K’
3. Sensitive to irrelevant features
4. It gets significantly slower when n (number of examples) increases and number of features or independent variables increases
5. Computation cost is high as measuring the distance between all the data point in the training sample.
Application of KNN:
KNN is used in various fields.
By calculating some factor and using KNN we can a voter will cast his vote to which party.
KNN can be used in banking system to predict weather an individual is fit for loan approval.
It’s a great application of classification with KNN. We will write an article about Recommender system very soon.
1. We need to clean the data very well before model building phase. Additionally, the implementation assumes that all columns contain numerical data and that the last column of our data has labels that we can perform some function on.
2. Domain knowledge is important while choosing the value for ‘K’ .
3. For model evaluation use confusion matrix for check the model accuracy when preforming classification.
4. Other statistical methods such as the likelihood-ratio test are also used for validation.
that’s it from this article. please comment your feedback. if there is some mistakes please address that.
connect me on