K-Nearest Neighbors is one of the most basic algorithm used for Classification.
KNN is a non parametric algorithm (meaning, it does not make any underlying assumptions about the distribution of data)belonging to supervised learning community. KNN algorithm can also be used for regression problems.The only difference will be using averages of nearest neighbors rather than voting from nearest neighbors.
Intuition behind the algorithm :
In K-NN algorithm output is a class membership.An object is assigned a class which is most common among its K nearest neighbors,K being the number of neighbors.Intuitively K is always a positive integer.Thus if K = 1.The object is assigned a class of its nearest neighbor.
Two questions arise if we try to understand K-NN :
1.How do we decide the value of K?
2.Which value is the nearest value i.e which distance metrics can be used?
How do we Decide Value of K?
Following are the different boundaries separting the two classes
If you observe the graph,with increasing value of ‘K’ boundaries become more smoother.With ‘K’ increasing to infinity it finally becomes all blue or all red depending on the total majority.
Lets examine training and validation error rate with varying values of K.
If we observe the training error rate graph it can be seen that error increases for increasing value of K,also error is zero for K=1.This is because the closest point to any training data point is itself.Hence the prediction is always accurate with K=1.
If validation error curve would have been similar, our choice of K would have been 1.
By observing validation error rate we can interpret that At K=1, we were over fitting the boundaries. In Validation graph Error rate initially decreases and reaches a minima. After the minima point, it then increase with increasing K. This value of K where error reaches minima should be used for all predictions.
Which value is the nearest value i.e which distance metrics can be used?
The most popular distance metrics used are :
Euclidean Distance(most popular)
Euclidean Distance :
Manhattan Distance :
The Manhattan distance between two vectors (or points) a and b is defined as ∑i|ai−bi| over the dimensions of the vectors.
It is called the Manhattan distance because all paths from the bottom left to top right of this idealized city have the same distance.
Chebyshev Distance :
In mathematics, Chebyshev distance (or Tchebychev distance), maximum metric, or L∞ metric is a metric defined on a vector space where the distance between two vectors is the greatest of their differences along any coordinate dimension. It is named after Pafnuty Chebyshev.(Wikipedia)
Applications of KNN :
- If you’re searching for semantically similar documents (i.e., documents containing similar topics), this is referred to as Concept Search.
- The biggest use case of K-NN search might be Recommender Systems. If you know a user likes a particular item, then you can recommend similar items for them.
- No assumptions about data — useful, for example, for nonlinear data
- Simple algorithm — to explain and understand/interpret
- High accuracy (relatively) — it is pretty high but not competitive in comparison to better supervised learning models
- Versatile — useful for classification or regression
- Computationally expensive — because the algorithm stores all of the training data
- High memory requirement
- Prediction stage might be slow (with big N)
- Sensitive to irrelevant features and the scale of the data.