How to Find Duplicate or Similar Images quickly with Python

Somil Shah
4 min readMay 9, 2020

--

Tired of cleaning the Whatsapp Images Folder with Billions of Images? Check this article out for a quick way out!

One of the main storage fillers and the most tedious to clean are Images and Video Files. They are just everywhere!

Our Smartphones and Computers are filled with millions of images from various sources. Almost Everyone is facing the issue of running out of space but no one likes to do the hard work of manual cleaning.

Finding and Deleting Duplicates as well as Finding Similar Images
Finding and Deleting Duplicate Images | Finding Similar Images

The Solution — Hashing

What’s Hashing?

Simply said, Hashing is the transformation of any data into a usually shorter fixed-length value or key that represents the original string.

Just as we have unique Fingerprints, Hashes are unique for any particular data. There are lots of Hashing Algorithms out there which cater to specific needs.

How is it relevant?

Since it’s ability to generate unique fingerprints, we can use Hashing to find duplicates, since similar images will have the same fingerprints.

But there’s a small catch

Most Hashing Algorithms will work for Finding Duplicates but very few will be able to find similar Images. Why? Since these algorithms produce big changes in the hash even though if there is a small change in data.

What do we do then?

We need to use a Hashing Algorithm specifically developed for Images i.e Average Hashing. This algorithm helps to solve the issue by creating smaller differences in hashes for similar pictures.

Average Hashing

Average Hashing is a very powerful algorithm specifically made for images.

It works in these specific steps:

  1. Reduce the size of the image (For better performance and removing high frequencies.)
  2. Convert to GrayScale.
  3. Compute the mean pixel value of the entire Image.
  4. Use the mean as a threshold to convert the entire image into 0 or 1 bits.
  5. Construct the Hash.
Average Hash Working

Know More

http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html

Let’s get to Code

Finding Duplicates

The ImageHash Library provides us with the Average Hash algorithm already, so it gets easy to implement.

We need PIL and Numpy as additional Dependencies for the code.

The Working

We simply compare 2 hashes and if they are the same, one of them is a duplicate.

Reading the Code:

  1. fnames is an array containing the list of Image Names.
  2. dirname is the directory in which the images are.
  3. the average_hash() function of the ImageHash library takes in the image and the hash_size (Default 8).
  4. hash_size of 8 means, the Image will be resized to an 8x8 matrix. So, if you want to improve performance, you can try to increase the hash_size.
  5. The hashes variable is a dictionary of the form {“Hash”: “Image”,…} .It stores the hash for every corresponding Image.
  6. So, if the hash is found again, the image will be declared as a duplicate and stored in a duplicates list.
  7. These duplicates, then can be deleted easily.

You can find the code for this in my Github Repo:

Finding Similar Images

The Working:

If you can remember, the image is finally stored as a matrix of 0|1 bits. In order to find the similarity between 2 images, we compare the hashes of images by using Hamming Distance.

What is Hamming Distance?

Do not go on the name, it’s very simple. Hamming distance is the number of bit positions in which the two bits are different. Let’s further understand with the help of an example:

Let’s consider 2 bitstrings: 100,010

The Hamming Distance is the EXOR of these 2 bitstrings:

100 ⊕ 010 = 110

The no of 1s is the Hamming Distance of these strings i.e 2.

Know More

https://www.tutorialspoint.com/what-is-hamming-distance

  1. similarity is a parameter which you can change depending on how similar you want the image to be. If 70%, the similarity will be 70.
  2. If we remember, Average Hashing finally converts the Images into 0|1 bit arrays.
  3. The core logic is that if the bits are ≤ (100-Similarity)% different while comparing 2 images, the Image is accepted.
  4. The Numpy “count_nonzero” helps us achieve this tasks with excellent performance.

Again, the code is available in my Github Repo:

--

--