How to identify and remove duplicate files with Python

Ewelina Fiebig
dida Machine Learning
3 min readJan 20, 2021

Suppose you are working on an NLP project. Your input data are probably files like PDF, JPG, XML, TXT or similar and there are a lot of them. It is not unusual that in large data sets some documents with different names have exactly the same content, i.e. they are duplicates. There can be various reasons for this. Probably the most common one is improper storage and archiving of the documents.

Regardless of the cause, it is important to find the duplicates and remove them from the data set before you start labeling the documents.

In this blog post I will briefly demonstrate how the contents of different files can be compared using the Python module filecmp. After the duplicates have been identified, I will show how they can be deleted automatically.

Example documents

For the purpose of this presentation, let us consider a simple data set containing six documents.

Here a figure showing the documents:

From left top to bottom right: doc1.pdf, doc2.jpg, doc3.pdf, doc4.pdf, doc5.pdf and doc6.jpg.

We see that the documents “doc1.pdf”, “doc4.pdf” and “doc5.pdf” have exactly the same content. The same applies to “doc2.jpg” and “doc6.jpg”. The goal is therefore to identify and remove the duplicates “doc4.pdf”, “doc5.pdf” and “doc6.jpg”.

Finding the duplicates

The module filecmp offers a very nice function filecmp.cmp(f1, f2, shallow=True) for this purpose. It compares the files named f1 and f2 and returns True if they seem to be identical. Otherwise it returns False. The shallow parameter allows the user to specify whether the comparison should be based on the -signatures of the files or rather on their contents. The comparison of the contents is ensured by the setting shallow=False.

An exemplary Python code for finding the duplicates could therefore look like this:

Output:

[['doc1.pdf', 'doc4.pdf', 'doc5.pdf'], ['doc2.jpg', 'doc6.jpg'], ['doc3.pdf']]

The above output is a list which contains the identified “equivalence classes”, i.e. lists of documents with the same content. Note that it’s enough to compare a given document with only one representative from each class, e.g. the first one class_[0].

We learn, for example, that the document “doc1.pdf” has the same content as the documents “doc4.pdf” and “doc5.pdf”. Furthermore, the document “doc2.jpg” has the same content as “doc6.jpg” and the document “doc3.pdf” has no duplicates. All this corresponds to what we have observed in the image above.

Removing duplicates

The next step would be to remove the duplicates “doc4.pdf”, “doc5.pdf” and “doc6.jpg”. An exemplary Python code that accomplishes this task could look like this:

There are certainly other ways to write the code or generally compare files. In this article I simply demonstrated one of the many possibilities.

I would also like to encourage you to take a closer look at the filecmp module. In addition to the filecmp.cmp()-function, it offers also other methods such as filecmp.cmpfiles() which can be used to compare files in two directories and may therefore suit your needs even better.

Originally published at https://dida.do.

--

--