Distance metrics and K-Nearest Neighbor (KNN)

Luigi Fiori
5 min readMay 22, 2020

--

Welcome back for this new post blog guys!

Today we will be going over to a really common and useful algorithm used both for classification and regression problems: K-Nearest Neighbor.

https://miro.medium.com/max/2556/1*OtKDB5QyX1LGnyuA4azGrg.png

While outside the scope of this post, in fact, it is worth to mention the K-means algorithm used for classification tasks that uses a similar principle together with KNN of clustering data points together using a distance metric in order to create homogeneous groupings.

In order to understand how this algorithm works we need to understand first the distance metrics, and use them to understand when 2 objects are similar.

The assumption behind is that “distance helps us quantify similarity”, and more similar objects are more likely to be the same class.

Working with our dataset we can use our columns as separate dimensions, after that we can plot each data point and calculate the distance between them.

Depending on the context of the problem several distance metrics can be used.

We will be going over:

  • Manhattan distance
  • Euclidean distance
  • Minkowski distance

Manhattan distance

https://miro.medium.com/max/480/1*52bc_cCCMgVF-SYZuWpTLQ.png

The easiest way to remember Manhattan distance is to use the analogy that provides this distance metric it’s name look at the picture above, but picture this grid as the famous grid of streets in Manhattan.

The formula to calculate Manhattan distance is:

The left side of the equals sign just means “the distance between point x and point y”. The ∑ just means “the cumulative sum of each step”.

So far, this discussion has explored Manhattan distance in a 2-dimensional space. However, all of the distance metrics we are going to learn can generalize to an n-dimensional space.

Euclidean distance

https://miro.medium.com/max/2678/1*yjxwoNqIwb1fXW26HvK_Kw.png

The equation at the heart of this distance is the Pythagorean theorem!: 𝑎2+𝑏2=𝑐2.

The formula to calculate Euclidean distance is:

For each dimension, we subtract one point’s value from the other’s to get the length of that “side” of the triangle in that dimension, square it, and add it to our running total. The square root of that running total is our Euclidean distance.

Just as with Manhattan distance, we can generalize this equation to 𝑛 dimensions.

Minkowski distance

The Minkowski distance is a generalized distance metric across a Normed Vector Space. A Normed Vector Space is just a fancy way of saying a collection of space where each point has been run through a function. It can be any function, as long it meets two criteria:

  1. the zero vector (just a vector filled with zeros) will output a length of 0, and
  2. every other vector must have a positive length

Don’t worry too much about the specifics of the mathematical definition above. Instead let’s try to gain an intuition for what Minkowski distance actually measures. Both the Manhattan and Euclidean distances are actually special cases of Minkowski distance, the only thing that changes is the exponent. Let’s take a look at the formula:

KNN

KNN is a distance-based classifier, meaning that it implicitly assumes that the smaller the distance between two points, the more similar they are. In KNN, each column acts as a dimension. In a dataset with two columns, we can easily visualize this by treating values for one column as X coordinates and and the other as Y coordinates. Since this is a Supervised learning algorithm, we must also have the labels for each point in the dataset, or else we wouldn’t know what to predict!

It is important to notice that KNN is quite particular compared to the other classifiers because during the “fit” step, KNN just stores all the training data and corresponding labels and no distances are calculated at this point.

All the work it is done during the “predict” phase.

During this step, KNN takes a point that we want a class prediction for, and calculates the distances between that point and every single point in the training set. It then finds the K closest points, or Neighbors, and examines the labels of each. You can think of each of the K-closest points getting to 'vote' about the predicted class. Naturally, they all vote for the class that has the highest count among all of the k-nearest neighbors.

Choosing an appropriate distance metric is essential and will depend on the context of the problem at hand.

Finding the Best Value for K

Changing the value for K can affect the performance of the model, so the question is what is the best value to use for K?

https://elvinouyang.github.io/assets/images/Introduction%20to%20Machine%20Learning%20with%20Python%20-%20Chapter%202%20-%20Datasets%20and%20kNN_files/Introduction%20to%20Machine%20Learning%20with%20Python%20-%20Chapter%202%20-%20Datasets%20and%20kNN_31_1.png

While the best value for K is not immediately obvious for any problem, there are some strategies that we can use to select a good or near optimal value.

In general, the smaller K is, the tighter the “fit” of the model.

Finding an optimal value of K requires some iterative investigation.

The best way to find an optimal value for K is to choose a minimum and maximum boundary and try them all! In practice, this means:

  1. Fit a KNN classifier for each value of K
  2. Generate predictions with that model
  3. Calculate and evaluate a performance metric using the predictions the model made
  4. Compare the results for every model and find the one with the lowest overall error, or highest overall score!

Note that KNN isn’t the best choice for extremely large datasets. This is because the time complexity (Big O) of this algorithm is exponential.

Evaluating model performance

Evaluating classification performance for KNN works the same as evaluating performance for any other classification algorithm we need a set of predictions, and the corresponding ground-truth labels for each of the points you made a prediction on. You can then compute evaluation metrics such as Precision, Recall, Accuracy and F1-Score.

Conclusions

KNN to generate a prediction for a given data point, finds the k-nearest data points and then predicts the majority class of these k points.

An incredibly important decision when using the KNN algorithm is determining an appropriate distance metric. This makes a monumental impact to the output of the algorithm.

KNN is extremely resource intensive for large datasets, because ss the number of data points and features increase, the required calculations increases exponentially.

Thanks for reading guys, for today it is all.

See you next time!

--

--

Luigi Fiori

The goal is to turn data into information, and information into insight.