Premise:- Urbanclap’s effort to provide professionals with a platform to showcase their work in form of images has exploded the media we handle in the recent past. We have about 14 million images as of now(and growing with a rate of approx. 10 percent monthly). Apparently, there were many cases of plagiarism that were being reported by professionals themselves, leading to a burgeoning demand of an automated system that could detect similar images as and when they are added. What was required was a foundation of a solution that could effectively detect duplicates with an algorithm that fits the constraints of the system we already have had. Even though we had data increasing at such a rapid rate we wanted to avoid setup of Hadoop clusters for distributed processing as much as possible and therefore an algorithm that could bypass its requirement was needed.
Conceptualizing Similar Images:- We need to define what we mean by term ‘similar’ in some logical context before we decide on if a pair of images is in fact similar. Operations on images could be largely classified into manipulation, that does not alter image authenticity (eg:- rotation) and modification, that does alter image authenticity (eg:- minor or major changes in information). Also, two images with minor modifications could be said to be similar but a series of even minor modifications could change the essential information of the original image rendering the resultant image ‘not similar’ to the original one. On the other hand, any number of manipulations do not render the resultant image as ‘non-similar’ to the original one. Manipulation is supposed to be inherently transitive whereas it’s not the case with modification. All algorithms must be ideally tweaked to handle both.
Hash Function Selection:- Selection of a hash function shall be guided by sundry factors:-
- No or minimal avalanche effect:- unlike cryptographic hash functions these should not be susceptible to minor ‘modifications’. Here is an explanation for avalanche effect
- Realistic Sizing:- The size of the image hash should be tuned for our needs. For instance, a hash that matches the orders of bytecode representation of the image itself is not feasible as it even though facilitates the capture of image data in its entirety, shall pose realistic storage and retrieval constraints.
- Hashes produced should compare:- It’s imperative that the hashes should be comparable in a sense that we can identify the similarity of the pair of images in a quantifiable sense. The comparison could be proportional to the difference or similarity of images. The judgment to use which would necessarily depend on the hash function you are using. In our system, we settled with the hamming distance between hashes that measure the bit distance between hashes as a measure of disparity in the image fingerprints for the purpose of simplicity and other processing requirements that I shall explain later. There could have been others like ‘cross-correlation’ to measure similarity in the images as possible choices for hashes. Moving on, these hashes should also support an algorithm that could for the purpose of comparison hover over the massive dataset and yield similarity set in linear time.
Now, there are various implementations for perceptual hash functions like ‘discreet cosine Transform’ (this is mostly used in JPEG image compression as it preserves essential media information), hashes based on the radial variance of pixels etc. I would strongly recommend going through this diploma work by Christoph Zauner for in-depth understanding.
I shall try to explain this part in as much detail as possible as I believe the audience of this blog shall be particularly interested in this. There were several factors like resource constraints, time complexity issues, availability of efficient open source hash generation libraries, unique requirements of the dataset we process upon that ultimately shaped and directed the judgments we made.
# Hash Function:-
We can come up with our implementation of hash functions or could simply exploit some of the already available ones. You can go through this or this as some of the opensource libraries, although the choice shall largely depend on factors like your dataset, calculation time (if your algorithm requires hash creation in real time) etc. You can also go through this interesting article comparing such metrics to help you with your choice. Since in our system we consciously decoupled hash generation phase with the hash comparison phase, the calculation time consequently did not significantly matter.
# Hash Generation Pipeline:-
I shall strongly advise decoupling the hash generation phase with other computation and rendering phases as depending on the library you use it could be intensive in terms of resources of both memory and time. You can have it either pushed through an event system like Kinesis (again if using AWS lambda beware there are these strict sizes and time constraints Amazon places on lambda which your hash generation algorithm or library may or may not support) or Kafka and computed on the run. We ourselves therefore finally, maintained an entire database for image hashes populated through events, that was later used for the batch processing.
# Similarity Threshold identification:-
This is the most crucial step and it could vary across datasets, hash functions, hash sizes etc. As mentioned before we settled with the hamming distance between two hashes, it required several iterations on our side generating similarity pairs on test data before we settled on a Hamming distance (bit difference between image hashes) threshold that could justifiably capture similar image pairs and refrain from producing pairs that were not perceptually similar. Luckily this number was small enough for us to exploit later in an optimized approach for generating similar pairs on a large dataset.
# Similar Pair generation approaches:-
Memory Intensive Approach:- In our case since we are calculating the hamming distance between two image hashes, which is in fact bit offset between two binary numbers, we used BK-tree which is a data structure that could be used to find a set of offset hashes for a particular hash in really efficient time complexity (O(l1*l2*logn) where l1 is the length of hash to be searched, l2 is the average length of the hashes in the data structure and n is the number of hashes in the data structure). This approach nevertheless fast, requires entire dataset of hashes to be taken in-memory and compared against the image hash that is to be compared. This is for sure not the most scalable solution for two basic reasons, first that it requires in-memory data dump, second, it shall require vertical scaling of machines as the data set grows (which in fact is happening at a rapid pace for us)
CPU Intensive approach:- This method requires a bit of engineering finesse. As the hamming distance between two numbers could also be calculated by simply XORing two numbers, we could exploit this fact to find a set of hashes for the hash of the image whose duplicates we need to find that are at some threshold hamming distance from it, these shall correspond to the similar images to the image whose hash we are matching.
- suppose our maximum hamming distance (threshold) for considering two images to be similar is 2
- hashes in our system are 3 digit binary numbers
- the hash of the image for which we are trying to find similar images say for instance is 2 (with binary representation 010)
- we shall maintain a list of 3 digit binary numbers (as mentioned above in point 2 of suppositions that the maximum length of hashes in our system is 3 ) that have the number of set digits less than or equal to 2 (equal to the maximum hamming distance as mentioned above in point 1 of suppositions). i.e (100, 010, 001, 110, 101, 011 say list A)
- we shall take XOR of the hash of image for which we are trying to find similar images (here hash of that image, as mentioned above in point 3 of suppositions, is 2) with each of the binary numbers in list A mentioned in the above point and we get a set of hashes that are at a Hamming distance of at maximum 2 (maximum hamming distance mentioned in point 1 of suppositions) as ( 2 ^ 100 => 110, 2 ^ 010 => 000, 2 ^ 001 => 011, 2 ^ 110 => 100, 2 ^ 101 => 111, 2 ^ 011 => 001 as list B)
- Once we have this list B we can directly search for this set of hash values in our database and pick up the images those match these values.
This process does not require in-memory dump of hash values and it could also scale horizontally. There is one caveat though. This Approach is going to work when we are able to limit this set size (list A mentioned in point 1 of the process), with set values equal or less than maximum hamming distance for similarity in our system, to a minimal (the size grows exponentially with threshold value of Hamming distance we are using for images to be similar). Remember, I told above in ‘Similarity Threshold identification’ section that this maximum value for us was pretty small luckily, it was for this reason.
# Similar Image pair storage and rendering pipeline:-
I would advise keeping a separate database storing the pair of similar images as this shall simplify database lookups and could also facilitate making the entire procedure incremental.
The actual implementation details at Urbanclap are a bit more complicated than that I have mentioned above, catering to the specific needs of our system and our magnitudes, but I hope this article has given an overall glimpse of the system.
Originally published at medium.com.