K — Nearest Neighbor classification

Lazy learners vs Eager learners

Classification methods like Bayesian, SVM, Rule based ,etc use a generalization (classification) model to classify new test tuples. This model is built before getting unknown tuples based on training data set. In other words these kind of classification methods are already eager (or ready) to classify unknown tuples. These are called “Eager learners”.

Lazy learners” wait until last minute to classify a new tuple. They do not use or construct any generalization models. Lazy learners simply stores training set. For a given unknown tuple these learners find a similar tuple from the training data set (already stored). Lazy learners do less work when take training tuples and more work when classify new unknown tuples. So it is computationally expensive and require efficient storage, but these are very suitable for implementing on parallel hardware.

K-NN (K nearest neighbor) classification

KNN classification method is proposed in 1950’s but popular around 1960’s, when computers power started to increase. Widely used classification method in pattern mining (or recognition) applications. This is a lazy learner and learn by analogy (means that it compares test tuple with a set of training tuples which are similar to it). “K” means, algorithm compare the test tuple with “K” nearest neighbors to classify.

if k=1 algorithm select the closest neighbor to the test tuple in pattern space

k = 1

If k > 1, assume k = n, then algorithm select n neighbors which are closest to test tuple. Select the majority voted class of selected neighbors for the test tuple class.

How to find nearest neighbors ?

KNN use distance metric to define “closeness” (similarity). Using that it take the k closest neighbors. It may use Euclidean distance, Manhattan distance,etc as distance matrices. Following example show how to use Euclidean distance.

Eg :- Let’s assume X = (x1,x2,……..xn) where X is in n dimensional pattern space. And x1, x2 …xn are the attribute values of A1, A2,….An. Suppose two points (tuples) X and Y such that X = (x1,x2…….xn) and Y = (y1,y2…….yn), then Euclidean distance between X and Y is

Figure 1 — Euclidean distance between two points

Attribute normalization

Usually it is better to normalize attributes before for calculating distance. The reason is there can be attributes which are in large range initially (eg:- salary) and they can outweighing the attributes with initially smaller ranges (eg:- binary attributes). Attribute normalization address this issue.

eg:- Min-Max normalization

v’ is in the range of [0,1]

distance for categorical non-numeric attributes

Assume X and Y are two instances of an attribute which represents colors (blue, red, green, white, etc). Hamming distance is one possible metric.

hamming distance

if two are identical (X and Y has same color, blue) then difference is taken as 0.

if X and Y are not identical then difference is set to 1. And also there are more sophisticated schemes which can do differential grading(eg:- assign larger score for blue and white than blue and black).

Deal with missing values

If attribute instances X and/or Y (attribute instances of A)are missing then assume maximum possible distance. For non numeric categorical values, select 1 if one or both values are missing.

If A is numeric attribute. Select 1 if missing from both X and Y. otherwise take max(|v’|, |1 - v’|)

Selecting a value for “k”

Use error rate of the classifier for each k values while increasing the value of k from 1. Then select the k value that gives minimum error rate.

K-NCC for predictions

Take the average of all real valued labels associated with k-nearest neighbors as the predicted value of test tuple.

Disadvantages and methods to improve efficiency

K-NNC suffers from poor accuracy with noisy and irrelevant attributes. This can be over come by including methods for tuple pruning and attribute weighting.

Extremely slow when classify test tuples (Lazy learner). With k = 1 it takes O(|D|) time complexity for comparison (D is size of the training tuples). But this can improve using methods like sorting and different structures to store tuples (using tree based structure we can reduce time complexity to O(log(|D|)) ). If we use parallel implementation then time complexity will be O(1) and is independent from D.

Other distance like partial distance calculation (use distance threshold and measure distance for only subset of attributes in set A, if distance value for subset exceeds the threshold then halt comparison and check with next nearest neighbor), pruning (remove tuples which are useless), etc also possible to use.