Detecting duplicate images using Python
With thousands of icons being uploaded to Iconfinder.com every month, the risk of pirated content also increases. To keep out the swindlers we have been working on a new clever image duplication technique.
One of the features that came out of our little hackaton and will be rolling out in the next couple of weeks is the ability to detect duplicate icons upon submission. So, for example if a user downloads an existing icon and tries to re-upload it in order to sell it and make a profit (yes, we had some of those cases) we will be able to detect it as a duplicate of an already existing icon and mark the account as fraudulent.
A frequently used solution to identify if a given file already exists in a large collection of files is to calculate a hash for each individual file in the dataset, store the hashes in a database and then when we want to locate a particular file calculate the hash for that file and look up to see if it’s hash value exists in the database.
Picking a hashing algorithm
A commonly used type of hash algorithms for this are the cryptographic hashing algorithms. Implementations of the most common used ones like MD5, SHA-1, SHA-256 are available in the standard library of almost any programming language and they are really easy to use and work really well for the most simple use cases.
For instance, in Python you would simply import the hashlib module and call a function to generate the hash value for a particular string or file.
>>> import hashlib# Calculating the hash value of a string.
>>> hashlib.md5('The quick brown fox jumps over the lazy dog').hexdigest()
'9e107d9d372bb6826bd81d3542a419d6'# Loading an image file into memory and calculating it's hash value.
>>> image_file = open('data/cat_grumpy_orig.png').read()
>>> hashlib.md5(image_file).hexdigest()
'3e1f6e9f2689d59b9ed28bcdab73455f'
This would work fine and dandy if we were sure the files uploaded aren't tampered with but due to the nature of how cryptographic hashing algorithms work even the slightest change to the input will result in an avalanche effect so the generated hash of the new file will be totally different from the one of the original file.Let's take an example where we add a period at the end of the sentence.# Original text.
>>> hashlib.md5('The quick brown fox jumps over the lazy dog').hexdigest()
'9e107d9d372bb6826bd81d3542a419d6'# Slight modification of the text.
>>> hashlib.md5('The quick brown fox jumps over the lazy dog.').hexdigest()
'e4d909c290d0fb1ca068ffaddf22cbd0'
As you can see in the example above, the values 9e107d9d372bb6826bd81d3542a419d6 and e4d909c290d0fb1ca068ffaddf22cbd0 are almost completely different (except for few characters).In the case of images if the background color is changed, the image is cropped or rotated or if just one pixel is modified out of the original image, we won't be able to match the hash of the image to an already existing one. This is not really practical since we want to detect similar images, even if they have been modified a little.For example when we modify the color of the nose on the cat the computed hash value will be mostly different.[caption id="attachment_4515" align="alignleft" width="256"]
Original image[/caption][caption id="attachment_4512" align="alignleft" width="256"]
Modified image[/caption]# Load the original image into memory and calculate it's hash value.
>>> image_file = open('data/cat_grumpy_orig.png').read()
>>> hashlib.md5(image_file).hexdigest()
'3e1f6e9f2689d59b9ed28bcdab73455f'# Load the modified image into memory and calculate it's hash value.
>>> image_file_modified = open('data/cat_grumpy_modif.png').read()
>>> hashlib.md5(image_file_modified).hexdigest()
'12d1b9409c3e8e0361c24beaee9c0ab1'
What we needed for our use case is a perceptual hashing algorithm. A perceptual hashing algorithm that takes a fingerprint of a multimedia file by deriving it from various features from its content so it can take into account transformations on a given input and yet be flexible enough to distinguish between dissimilar files.There are a number of perceptual hashing algorithms but the one that I'm gonna present is the dHash (difference hashing) algorithm which computes the difference in brightness between adjacent pixels, identifying the relative gradient direction.dHashBefore we dive into the algorithm, let's start with some basics. A color image is comprised out of red, green and blue (RGB) pixels and we can think of it as a list of sets of red, green and blue values. For example, let's take an image, load it using Python's imaging library (PIL) and print the pixel values.[caption id="attachment_4518" align="aligncenter" width="300"]
Test image[/caption]>>> from PIL import Image
>>> test_image = Image.open('data/test_image.jpg')# The image is an RGB image with a size of 8x8 pixels.
>>> print 'Image Mode: %s' % test_image.mode
Image Mode: RGB
>>> print 'Width: %s px, Height: %s px' % (test_image.size[0], test_image.size[1])
Width: 4 px, Height: 4 px# Get the pixel values from the image and print them into rows based on
# the image's width.
>>> width, height = test_image.size
>>> pixels = list(test_image.getdata())
>>> for col in xrange(width):
... print pixels[col:col+width]
...
[(255, 0, 0), (0, 255, 0), (0, 0, 255), (255, 255, 255)]
[(0, 0, 0), (212, 45, 45), (51, 92, 154), (130, 183, 47)]
[(206, 210, 198), (131, 78, 8), (131, 156, 180), (117, 155, 201)]
[(104, 133, 170), (215, 130, 20), (153, 155, 155), (104, 142, 191)]
The first three pixels have an intensity value of 255 for red, green and blue respectively and 0 for other values, white has all three values at their maximum intensity (255) and black has all three values at their lowest intensity (0). The rest of the color pixels are a combination of red, green and blue at different intensities.Now let's get back on track, the dHash algorithm has four steps, I'm gonna go through each one and illustrate it's effects on the original and modified image.1. Grayscale the image.By grayscaling the image we reduce each pixel value to a luminous intensity value. For example, a white pixel (255, 255, 255) will be reduced to an intensity value of 255 and a black pixel (0, 0, 0) will be reduced to an intensity value of 0.[caption id="attachment_4516" align="alignleft" width="256"]
Original image (after step 1)[/caption][caption id="attachment_4513" align="alignleft" width="256"]
Modified image (after step 1)[/caption]2. Shrink the image to a common size.By shrinking the image to a common base size, for example 9x8 pixels, where the width is 1px larger than the height (you'll understand why the odd size in step 3). This way we remove all the high frequencies and details of the image and we are left with a sample size of 72 intensity values. This also means that resizing or stretching an image won't affect it's hash value, all images are normalized to a common size.[caption id="attachment_4517" align="alignleft" width="180"]
Original image (after step 2)[/caption][caption id="attachment_4514" align="alignleft" width="180"]
Modified image (after step 2)[/caption]3. Compare adjacent pixels.After the previous two steps we are left with a list containing intensity values, we then compare each intensity value to it's adjacent value for each row resulting in an array of binary values.>>> from PIL import Image
>>> img = Image.open('data/cat_grumpy_orig_after_step_2.png')
>>> width, height = img.size
>>> pixels = list(img.getdata())
>>> for col in xrange(width):
... print pixels[col:col+width]
...
[254, 254, 255, 253, 248, 254, 255, 254, 255]
[254, 255, 253, 248, 254, 255, 254, 255, 255]
[253, 248, 254, 255, 254, 255, 255, 255, 222]
[248, 254, 255, 254, 255, 255, 255, 222, 184]
[254, 255, 254, 255, 255, 255, 222, 184, 177]
[255, 254, 255, 255, 255, 222, 184, 177, 184]
[254, 255, 255, 255, 222, 184, 177, 184, 225]
[255, 255, 255, 222, 184, 177, 184, 225, 255]The first value (254) will be compared to the second value (254), the second value to the third and so on, giving us 8 boolean values per row.>>> difference = []
>>> for row in xrange(height):
... for col in xrange(width):
... if col != width:
... difference.append(pixels[col+row] > pixels[(col+row)+1])
...
>>> for col in xrange(width-1):
... print difference[col:col+(width-1)]
...
[False, False, True, True, False, False, True, False]
[False, True, True, False, False, True, False, False]
[True, True, False, False, True, False, False, False]
[True, False, False, True, False, False, False, True]
[False, False, True, False, False, False, True, True]
[False, True, False, False, False, True, True, False]
[True, False, False, False, True, True, False, False]
[False, False, False, True, True, False, False, True]
4. Convert the difference into bits.
To make the hash easy to store and use, we convert it into a hexadecimal string. A True value will have a binary value of 1 and a False value will have one of 0.
Python implementation
A full implementation of the algorithm in Python:def dhash(image, hash_size = 8):
# Grayscale and shrink the image in one step.
image = image.convert('L').resize(
(hash_size + 1, hash_size),
Image.ANTIALIAS,
)pixels = list(image.getdata())# Compare adjacent pixels.
difference = []
for row in xrange(hash_size):
for col in xrange(hash_size):
pixel_left = image.getpixel((col, row))
pixel_right = image.getpixel((col + 1, row))
difference.append(pixel_left > pixel_right)# Convert the binary array to a hexadecimal string.
decimal_value = 0
hex_string = []
for index, value in enumerate(difference):
if value:
decimal_value += 2**(index % 8)
if (index % 8) == 7:
hex_string.append(hex(decimal_value)[2:].rjust(2, '0'))
decimal_value = 0return ''.join(hex_string)Comparing hashesIn the most simple case where the images differ only slightly the hashes most likely will be the same so we can directly compare them.>>> from PIL import Image
>>> from utility import dhash, hamming_distance
>>> orig = Image.open('data/cat_grumpy_orig.png')
>>> modif = Image.open('data/cat_grumpy_modif.png')
>>> dhash(orig)
'4c8e3366c275650f'
>>> dhash(modif)
'4c8e3366c275650f'
>>> dhash(orig) == dhash(modif)
TrueIf we kept an SQL database with our hashes and we wanted to see if the '4c8e3366c275650f' hash existed in our database we could simply do something like this:SELECT pk, hash, file_path FROM image_hashes
WHERE hash = '4c8e3366c275650f';Now, in the case that the images differ a bit more and the hashes are not exactly the same we need to compare them by calculating the minimum number of substitutions required to change one string into the other, this is called a hamming distance.On the Wikipedia page there is some sample Python code that computes the hamming distance between two string but we can calculate the hamming distance and query based on it directly in MySQL:SELECT pk, hash, BIT_COUNT(
CONV(hash, 16, 10) ^ CONV('4c8e3366c275650f', 16, 10)
) as hamming_distance
FROM image_hashes
HAVING hamming_distance < 4
ORDER BY hamming_distance ASC;It will do a binary XOR of the two values, the first one from the database and the second one the one that we query, and count the bits that are different amongst the two values. Because BIT_COUNT works only on integers and we stored the hash as a hex string (base 16) we have to convert it to a decimal value (base 10), the same goes for the value that we want to query by.
Ending words
Even though I've presented the algorithm using Python snippets the algorithm is pretty general and can be implemented in any other programming language.As I said in the introduction, we will be using the algorithm in the future on Iconfinder to warn us if a submitted icon already exists in our collection but it can also have other practical uses. For example, because images with similar features have the hamming distance of their hashes close, it can also be used as a basic recommendation system where the recommended images are within a certain hashing distance of the current image.