Software Intelligence: Equivalence Relations and Deduping your Music Library

Dimitri Papaioannou
Software Science
Published in
5 min readJul 8, 2020

Motivation

One way or another, I always end up with duplicate songs or pictures in my digital libraries, as well as duplicate contacts in my address book. For music, two songs having the same title does not necessarily make them the same song, so a more complex relation is required. I have developed some techniques for deduping and what they all have in common is that they are looking for equivalence relations. In this article I will introduce equivalence relations and apply them to the simple example of deduping a music library.

The Equivalence Relation

Relations are extensively used in Computer Science. For example, in a Relational Database, a relation is established from the table Song to the table Artist by creating a foreign key. In a graph, node x is relate to node y if there is an edge from x to y. In Object Oriented design, class Song can be related to class Artist via Composition if Song contains Artist as a field. The examples are countless.

In Set Theory a relation R on a Set A is a set of ordered pairs {(x,y)|x,yA}. Consequently, a relation is simply a subset of A×A, the Cartesian product of A with itself.

Some relations have interesting properties that make them special. A relation R is reflexive if (x,x)∈R,∀xR. In other words, if each element of A is related to itself. For example, the relation ≥ in the set of real numbers is reflexive since xx is true for every real number. Similarly, the relation “Craig is as fast as Nancy” defined on the set of people is reflexive, since I am as fast as myself, even though sometimes I get ahead of myself.

A relation R is symmetric if (x,y)∈R⟹(y,x)∈R, which means that if
x if related to y then y is also related to x. Friendships in Facebook are considered to be symmetric since if Bob is friends with Tim, then Tim is also a friend of Bob, even though in real life friendships may not be as reciprocal. Following on Twitter are not symmetric since you may be following Kim Kardasian but, since you are reading this blog, it is very unlikely that she is following you.

A relation R is transitive if whenever (x,y)∈R and (y,z)∈R, then (x,z)∈R. For example is taller is a transitive relation since if Tim is taller than Bob and Bob is taller than Greg, then Tim is also taller than Greg. The relations ≥,≤,⊂, ⊃ are other famous transitive relations, and in general, every type of comparison is expected to be transitive.

This leads us to the relation that has it all: A relation that is reflexive, symmetric, and transitive, is an Equivalence Relation. Equivalent relations are omnipresent in Math and Software, though not so much in real life where some animals are always more equal than others. For example, modular arithmetic is an equivalence on the set of integers, equality of norm defines an equivalence on a set of vectors, connectivity defines an equivalence on a graph.

When two elements of the set A are related via an equivalence relation, we say that they are equivalent and the relationship is denoted by xy.

Given a set A and an equivalence relation R, for any element xA, the set of all the elements equivalent to x is called the equivalence class of x and is denoted by [x]. Each equivalence class is a subset of A. The following theorem is very important, albeit rather obvious. I supply the proof at the end of this article.

Theorem: The collection of all equivalence classes under an equivalence relation R partitions A into disjoint subsets.

Note: A partition of a set A is collection of disjoint subsets Bi, such that ∪iBi=A.

Finding duplicates

Now let’s look at how equivalences are used to find duplicates in a music library. As I mentioned in the introduction, the same title does not necessarily identify two songs as duplicates, especially if the title is “Track 1”. I have noticed that if both the title and the size of the track match, then they are almost certainly duplicates.

So I can identify duplicates by creating an equivalence relation on the set of my music files. I define two mp3s to be equivalent if the have the same title and the same size. Then I partition the set of songs into subsets using this equivalence relation. Any subset that has more than one element consists of duplicate songs.

So if my song class is this (Kotlin):

Then I can use this code to find duplicates:

The succinctness of Kotlin perhaps makes it difficult to follow how the task is accomplished. Here is the same code in Java:

I represent the equivalence class by an an object identified by title and size. The data class in Kotlin gives me equals() and hashCode() for free, but in Java I need to explicitly define them to use this object as a key to a map.

In the following code, I construct a map where the key is the equivalence class and the value is the list of songs that belong to that equivalent relation, which I build as I iterate through the list of songs.Finally, I retain only the equivalence classes that contain more than one element. These are the ones that contain duplicates.

Admittedly, in this case I got lucky in terms of performance. Because I am able to identify each equivalence class by looking up a key in a map, which is an O(1) operation, the complexity of the code is O(n). In other situations, looking up the equivalence class may not be as straightforward. For example, when I am looking for duplicate contacts in an address book, two contacts will be identical if they share one of many email addresses or if they share one of many phone numbers. It is not possible to determine this relation by a single lookup, so I would have to become more creative, since a brute force approach will result in O(n²) complexity. I will leave this harder problem for another blog post.

Proof of the Theorem

The proof of the theorem can be found in the original blog post on blogger: https://tecacet.blogspot.com/2020/01/equivalence-relations-and-deduping-your.html

Originally published at http://tecacet.blogspot.com.

--

--

Dimitri Papaioannou
Software Science

I am a software scientist, applied mathematician, and physics enthusiast . My passion is to develop solutions at the intersection of software and mathematics.