Birds of a feather flock together

The story of the KNN classifier

6 min readAug 20, 2022

This post will introduce you to the K nearest neighbors (KNN) classifier and how it can be implemented using the R programming language. Along the way, we will talk about the historical origin of the method and discuss the importance of having separate train and test datasets.

About classification models

How well is our classifier performing?

Classifier algorithms in general are aimed at predicting some categorical property of a data point based on its features. Of course, in order to assess anything, we have to be able to measure the classifier accuracy. The simplest such measure in a classification setting is the error rate, defined as the proportion of incorrect classifications.

Choosing the best model — why we need the test set

At first glance, picking the optimal model seems straightforward. We take the data, construct the model and pick the one with the lowest error rate. The problem with this approach is that we are not interested in minimizing the error rate on the dataset used for model construction, but on some previously unseen data. The simplistic approach described above would lead to an issue called overfitting, meaning the model is so attuned to the training data that it loses the ability to generalize on any other dataset.

This problem can be remedied by splitting the data into a train and test dataset. The training set is used for model construction, and the test set for its evaluation. Consequently, the best model is the one with the lowest error rate on the test dataset. The sest for training and testing should come from the same population. However, in order to produce an unbiased assessment of the error rate on real data, no information from the test set should be used during model construction.

The KNN classifier

Historical background and intuitive explanation

The earliest mention of the KNN algorithm dates back to 1951, and is actually a technical report¹ for the US Airforce. A reprint of the original report has been published in 1977 and 1989² (with a formal review of the classifier properties).

This method is probably the most intuitive classifier algorithm, and all of us actually use it on a daily basis. In short, it could best be described by an old English proverb:

Birds of a feather flock together.

In other words, the algorithm assigns classes to data points based on their surroundings (their “nearest neighbors”). Figure below illustrates the working principle of the KNN algorithm. If a decision was made based on the three nearest neighbors (the inside of the full circle), the x point would be classified as a triangle. However, extending the model to five nearest neighbors (the insides of the dotted circle) would yield a square as an estimate.

The KNN algorithm

Technically speaking, the method can be reduced to the following steps:

Splitting the data into training and testing sets
Choosing how many nearest neighbors will be taken into account when determining the unknown class of our data point (setting the value of the K parameter)
Categorizing the the data point to the class that occurs most frequently in its K nearest neighbors

In order to be able to determine how close the neighbors are, we need a measure of distance. Different measures can be used in this step. However, the most commonly used one is the Euclidean distance, named after the Greek mathematician Euclid, also referred to as the father of geometry. For two points a and b in an n-dimensional space, the Euclidean distance can mathematically be expressed as

Euclidean distance formula

In two dimensions (n = 2), this equation reduces to the well known Pythagoras theorem, c² = a² + b², where c is the hypotenuse, and a and b the other two sides of a right-angled triangle.

In order to give equal importance to all input parameters, we have to account for different scales used in each of them. For example, our model might use the height in meters, and weight in kilograms. Using the Euclidean measure of distance, a height difference of 1 m and a difference in weight of 1 kg would contribute equally to the model. However, we know that a difference in height of 1 m is much more important than a weight difference of 1 kg. The problem can be solved by scaling the input parameters to a [0,1] interval. For the i-th value of some input variable a, this is done using the following equation:

Formula for scaling the data to a [0, 1] interval.

Example: iris flower species classification

Exploratory data analysis

The iris dataset contains information on the physical characteristics of three iris species. We will try to use these characteristics to correctly predict the species using the KNN classifier.

The dataset is built into R, so we can access it by just typing its name into the console. In all of my posts, I am using the tidyverse set of packages to manipulate the data, so I am loading them here prior to any data analysis. If you are not familiar with data wrangling using the tidyverse , I also wrote a helpful post on that topic.

Diving into the tidyverse using the Titanic data

towardsdatascience.com

Let’s take a peek at a first couple of rows:

We can see that the dataset contains five columns. The first four columns are lengths and widths of iris flower parts, and the last column contains the species label. Consequently, the first four columns contain input variables, and the last one is the output parameter of interest.

We can also plot the data in order to see how well the species are separated based on the flower characteristics.

Distribution of iris flower species based on the dimensions of their petals.

The graph clearly shows that the species are well separated based on the dimensions of their petals. There is only a small overlap between the versicolor and virginica species. Based on these insights, we expect that our classifier shouldn’t have a hard time figuring out the correct classes.

Applying the algorithm to the data

Following the algorithm recipe laid out in the previous section, we first have to scale the data and split it into a training and testing dataset.

To create the model, we will use the knn function from the class library. The first four arguments to the function are the most important. They are the training set, testing set, training set labels, the k parameter (the number of neighbors to take into account while performing classification), respectively. The function returns the predicted labels for the test dataset.

Let’s now check how accurate was our model.

The obtained accuracy is great. We have only made three mistakes. However, this result was expected since the iris species are quite different with respect to their physical characteristics. Consequently, the groups don’t overlap very much, so the classifier had an easy job. Moreover, as expected from the above graph, the setosa species was classified perfectly, and the mistakes were made only for the two overlapping species. The datasets encountered “in the wild” will usually be much more complex and noisy, so the accuracy will most often be lower as well.

[1] E. Fix i J. Hodges: “An important contribution to nonparametric discriminant analysis and density estimation”, International Statistical Review 3.57 (1951), p. 233–238.

[2] Bernard W. Silverman i M. Christopher Jones: “E. Fix and J.L. hodges (1951): An important contribution to nonparametric discriminant analysis and density estimation:Commentary on Fix and Hodges (1951)”, International Statistical Review/Revue Internationale de Statistique (1989), p. 233–238.