Member-only story

9 Distance Measures in Data Science

The advantages and pitfalls of common distance measures

Maarten Grootendorst
TDS Archive

--

Distance Measures. Image by the author.

Many algorithms, whether supervised or unsupervised, make use of distance measures. These measures, such as euclidean distance or cosine similarity, can often be found in algorithms such as k-NN, UMAP, HDBSCAN, etc.

Understanding the field of distance measures is more important than you might realize. Take k-NN for example, a technique often used for supervised learning. As a default, it often uses euclidean distance. By itself, a great distance measure.

However, what if your data is highly dimensional? Would euclidean distance then still work? Or what if your data consists of geospatial information? Perhaps haversine distance would then be a better alternative!

Knowing when to use which distance measure can help you go from a poor classifier to an accurate model.

In this article, we will go through many distance measures and explore how and when they best can be used. Most importantly, I will be talking about their disadvantages so that you can recognize when to steer clear of certain measures.

NOTE: For most distance measures long elaborate papers could and have been written on their use-cases, advantages, and disadvantages. I will try to cover as much as…

--

--

Responses (26)