K — nearest neighbor (KNN) Algorithm & its metrics

Sravanthi Dande
6 min readOct 19, 2021

--

Explanation of KNN algorithm:

  • KNN is a supervised Machine Learning Algorithm.
  • It is a non-parametric classification algorithm as it doesn’t make any assumptions on the form of the mapping function.
  • It is used for both Classification and Regression.
  • If KNN is used for classification, depending on which class is more among the nearest neighbors the algorithm classifies the output as that class (majority voting).
  • If KNN is used for Regression, the average value of all the nearest neighbors is considered as output.
  • It considers the similarity between new data and available training data and puts the new data into the most similar category from the available data.
  • KNN majorly works on calculating the distances between a test data and all the rows in training data by selecting specified number of examples(K). Then it votes for majority label (in case of classification) or average of labels (in case of Regression).
  • KNN model is build using KNeighborsClassifier() from sklearn module.
  • Here we use Euclidean distance for calculating the distance between two data points (to find the similarity)
Example- 1 for KNN algorithm
Example — 2

Examples for KNN algorithm usage:

Recommendations for:

  • products in Amazon
  • videos in YouTube
  • movies in Netflix etc.

Advantages:

  • Simple and easy to implement.
  • It is flexible as it can be used for classification and regression.

Disadvantages:

  • If the volume of data (number of predictions/number of independent variables) is more the algorithm gets slower.
  • Determining the value of K is complex sometimes.
  • As the distance is calculated between the data points for all the training samples, the computation cost is high.

Various steps in KNN algorithm (pseudo code):

1) Import the libraries

2) Explore, clean, and prepare the data (Read the data from .csv file, checking the shape of data, checking for null values and handling them)

3) The data is then split into X_train, X_test, y_train and y_test.

4) Setting number of neighbors for KNN classifier (choose a value for K).

  • Choosing right value for K is done by trying several Ks and picking the one that works best for the data.
  • A very low value for K can give noisy data and lead to having outliers in the model.
  • Large value of K is good, but it may be difficult for the model to predict.

5) Building the Model using KNeighborsClassifier() and fitting the model to train it using X_train and y_train.

  • Calculate Euclidean distance of test data point to all rows in training data set.
  • Sort the distance calculated from each data point in ascending order
  • Pick first K entries.
  • Get the labels of K entries.
  • If it is a regression model, mean of K labels is returned.
  • If it is a classification model, mode of K labels is returned (pick most frequent class of the labels).

6) Measuring the training and testing accuracy for all K classifiers and plot a graph between number of neighbors (K Classifiers) and Accuracies.

7) Evaluate the model by predicting Label values by passing X_test and are storing them in y_pred

8) To check how good is the model we can measure Accuracy Score, F1 Score, Precision Score, Recall Score, Confusion Matrix, Specificity.

Why knn is called instance-based learning?

  • KNN is also called instance-based learning because it uses raw training instances to make predictions.
  • It is also referred to as a lazy learning algorithm as there is no prior learning required for the model from the training set. At training time all it does is storing the training set.

Distance measure used for similarity and its explanation with mathematical formula and geometric picture:

Euclidean Distance is the most popular distance measure.

  • It is the distance between two real valued vectors. It is the true straight-line distance between two points.
  • It is calculated as the square root of squared differences between a new data point and existing data point from training set across all input attributes.
  • This distance measure is appropriate if the input variables of similar type (eg: all measured widths and heights like integer, float etc.)
  • It is the default metric used by sklearn for KNN algorithm
Calculating distance between two points X1 and X2
Generalized formula
Formula for 2D space

If we are calculating for D dimensional space the formula is:

Other most popular distance measures:

Hamming distance:

  • It is used to calculate the distance between two binary data strings.
  • Equal distance between two binary strings is calculated by number of bit positions in which the two bits are different
  • The result shows how many attributes were different.
  • Eg: “ABCDE” and “AGDDF” have same length of 5. Hamming distance goes letter by letter in each string and checks if they are similar. Except for First ‘A’ and 4th ‘D’ remaining 3 are not similar, so the hamming distance is 3 here.

Manhattan distance:

  • Also called City block distance or taxicab distance.
  • It is calculated using the distance between real vectors using the sum of their absolute difference.
Manhattan distance
  • It is good distance measure if the input variables are not of similar type (eg: age, gender, height etc.)
Calculating distance between two destinations

Minkowski distance:

  • It is the generalization of Euclidean and Manhattan distance.
  • It is used for real values vector spaces.
  • We can calculate this distance only in a normed vector space, which means it is a space where distances are represented as vector that has length, and that length cannot be negative.
Minkowski Formula

Here P determines the space (2D or 3-dimensional space)

If p=1 — It is Manhattan distance

If p=2 — it is Euclidean distance

Few other distance measures are:

Tanimoto distance

Jaccard distance:

  • It looks at two data sets and finds the incident where both the values are equal to 1.
  • So, the result shows the number of 1 to 1 match compared to the total number of data points.
Jaccard index formula

Jaccard distance is the obtained by subtracting jaccard index from 100%

Jaccard distance

Mahalanobis distance

Cosine distance:

  • It is mainly used to calculate the similarity between two vectors.
  • It is calculated by the cosine of the angle between two vectors and determines if the two vectors are pointing in the same direction.
Cosine distance formula

The best distance measure can be selected based on the properties of the data. We can also experiment with different distance measures and K values and stick to the one for which the accuracy is good.

--

--