Understanding the k nearest neighbors
So I have been reading a lot of stuff on data science off late, since you know its is in. But then you get mixed up in so much there is on the internet to study about machine learning and deep learning (yeah these two terms are different!), that it made my head spin with so much to re-learn. Since all of the topics that are being described on the internet, for a fact I had learnt in college, but was dumb enough to forget about them because as we all know, we all were stupid in college. Well at least I was 😛.
So I took out my books from college days, and started from there. I have a book named “Machine Learning in action by Peter Harrington”. It is an amazing hands on book, if anyone wants to learn the basics and get into this field with the basics all sorted out. So yes this book is highly recommended, and I am going to explain things from this book itself.
There are three types of machine learning techniques:
- Supervised learning, where you train the algorithm over already labelled structured data set and then you try to predict the outcome of the new data via experience gained from the test data set.
- Unsupervised learning, where you train the algorithm over an unstructured data set and expect it to recognize the pattern set in the data set, then use that pattern to adjudge the outcome of a new data set.
- Reinforcement learning, where you basically provide a feedback loop to the unsupervised learning algorithm agent to learn even from the current experience to reinforce its pattern recognition mechanism and strengthen its model for future, famous for training game bots to learn and beat the game.
I am going to start off with supervised learning methods first. In order to understand better, here is the basic terminology on machine learning:
- Training data set: The data set you try to train your algorithm on.
- Testing data set: The data set you run your algorithm on, to test the correctness of the model.
- Error rate: The measure of difference between the expected and actual result of the model when run on the test data set. Typically should be under 5% to be accurate.
Each machine learning algorithm has following steps:
- Collect and clean the data set to be studied.
- Select the right feature to study the data set on.
- Divide your data set into 80/20 ratio, where 80% is the training data set and rest 20% is testing data set.
- Run your machine learning model over the the rest 20% of the data set and validate your model, if the error rate is more than the recommended tolerance rate, go back to step 2 and repeat, else congratulations, you have trained your data set to predict the values for unknown data set.
Today I am going to discuss on k nearest neighbors algorithm, which is a very classic classification algorithm where you try to classify something into a group or class on the basis of previous knowledge built whilst training your model.
So this algorithm basically means to say, find me the best suited class or group for the test data point which was provided as an input to the system.
It tries to return the highest voted class or group for the k (k can vary from 1 to n) neighbors after finding out the euclidean distance of the data point from the trained model for each class or group identified during training phase.
Now what is euclidean distance? It is the distance between two points in an n-dimensional space.
For example, in a 2D space for two points, (x1,y1) and (x2,y2) it can be described as:
Euclidean distance = sqrt(( x1-x2)² + (y1-y2)²)
This whole algorithm runs on this simple equation, which basically tries to classify the data point based on the distance from the previous data groups, and the closest one wins.
In order to have a look at the working example of this algorithm, I have tried to sum it up in my github project where I try to identify the digits 0–9 to which you will find the link here.
Disclaimer: My code is heavily inspired from the book I have mentioned in this article.
