The Dissimilarity of Numeric Data

Ali Fallahi
5 min readJul 26, 2020

--

Abstract

One of the common fundamental tasks in data mining is calculating the differences between objects. Likewise, in any other calculation and validation step, there are some measures to calculate the dissimilarity of numeric data. In this article, we will discuss the Euclidean and Manhattan distance as the two most common distance measures in the dissimilarity of objects described by numeric attributes. Moreover, there is also a specific section for the Minkowski distance as the generalization of Euclidean and Manhattan distance.

  1. Introduction

As a consequence of the rapid increase in the internet’s data in recent years, users faced lots of difficulties finding what they searched. Regarding this issue, recommender systems as a subclass of filtering systems assist users in accessing what they need faster and easier. The most common technique to make a recommender system is Collaborative filtering (CF). To crystalize the CF technique, let’s have an example: imagine that user A watched and liked Harry Potter, users B and C also watched and loved Lord of the Rings. So based on the CF, by considering the same opinions between the users about Harry Potter, there is a chance with a high probability that user A also will like to watch Lord of the Rings. In this example, a collaborative filtering based recommender system will recommend Lord of the Rings to the user A.

However, there is an essential question of how can we calculate similarity or dissimilarity between the objects? In the following paragraphs, we will explain some of the most popular distance measures in the dissimilarity of objects described by numeric attributes, such as Euclidean and Manhattan distance and also Minkowski distance as the generalization of the first two mentioned distance measures.

The rest of this article organized as follows: Section 2 explains dissimilarity measures in numerical data. In section 3, we review the measures by providing some examples. Finally, we conclude this study in section 4.

2. Dissimilarity measures in numerical data

2.1 Euclidean distance

The first and the most common measure to calculate the dissimilarity of numeric data is Euclidean distance, also known “as the crow flies.”

Example of the Euclidean distance

Euclidean distance is the straight line between the starting point and destination. If we consider i and j as follows

the Euclidean distance between these two objects can be calculated from the below formula:

2.2 Manhattan distance

Another well-known measure for calculating dissimilarity named the Manhattan distance, also known as the taxi driver or city block distance. In contrast to the Euclidean distance, the Manhattan distance, we count city blocks that we need to pass in moving from the starting point to the destination.

For instance, based on the below map, the taxi driver to reach the destination, first, has to move five blocks to the left and then three blocks toward the north direction.

Example of the Manhattan distance

Important tips about Euclidean and Manhattan distances

Both Euclidean and Manhattan distances have some important mathematical properties like:

  • Non-negativity

The distance between two points like p and q is always equal or greater than zero. d(i,j) >= 0

  • Identity of indiscernibles

The distance between any object to its self is equal to zero. d(i,i) = 0

  • Symmetry

Distance is a symmetric measure. d(i,j) = d(j,i)

  • Triangle inequality

Base on the Triangle inequality, distance from i to j, can not be greater than when we move from i to j with a detour of k.

Also, each side of the triangle is greater than the result of the difference between the other two sides.

2.3 Minkowski distance

The Minkowski distance is the generalization of the Euclidean and Manhattan distances. In the below formula, the h is a real number, which is also greater than 1. h >= 1

In the Minkowski distance formula, for h=1, the result will be the same as the Manhattan distances, and for the h=2, it will be equal to the Euclidean distance.

Supremum distance

Supremum distance is the generalization of the Minkowski distance when the h approaches infinity. Supremum distance can be helpful when we want to calculate the maximum distance between two objects.

3. Examples

Now its time to review and test what has been mentioned above by some examples.

Example of calculating the dissimilarity of numeric data

By considering the above picture, the Euclidean, Manhattan, and Supremum distances between the start and destination points can be calculated as follow:

  • Euclidean:
  • Manhattan:
  • Supremum:

4. Conclusion

In this article, we discussed the dissimilarity of numeric data. The Euclidean, Manhattan, and the Minkowski distances are some of the most common distance measures for calculating the dissimilarity of numeric data. Finally, to clarify the topic, some examples were also explained.

The main reference for this article was:

Data Mining: Concepts and Techniques

Authors:
Jiawei Han, Micheline Kamber and Jian Pei

--

--

Ali Fallahi

I am a Ph.D. student in Computer Software Engineering. Generally, here I write about recommender systems and machine learning.