Getting tired of re-uploads.

Kilian Brachtendorf
Jan 16 · 23 min read

Dismantling 9Gag with perceptual image hashing. Part 1


The website in question, 9Gag, is a social media platform featuring user generated content in form of videos, images and gifs. During the first semesters of university, consuming the content right before going to bed was a part of my daily routine and a great time sink.

Since then, the quality of the submitted entries gradually decreased and the serious flaws in the website’s operation became apparent:

  1. Reposts: Similar to other well known social media sites, users are incentivised to upload popular content. Some individuals are motivated to reupload content that has already been posted by someone else. Seeing the same images over and over gets tiresome pretty quickly.
  2. Filtering promotional content and debunk vote manipulation: The company behind the platform is accused to feature promotional content disguised as normal posts. Let’s figure out if we can differentiate legit and artificial content. A bit more information can be found here.
  3. Meme classification: There is no possibility to look at a specific meme category. Figure out how to categorize the data in a suited way.

The above tasks are really well suited to be solved by utilizing a technique known as perceptual image hashing, which I stumbled upon earlier last year by reading Dr. Neal Krawetz’s blog post “kind of like that”.

Fitting: An image found while looking at my scrapped data ….

All source code is available at github. If you want to follow along clone the project https://github.com/KilianB/JImageHash.git and take a look at the example package.

Note: We will be working with around 15.000 images (~1 GB of data). IO Operations take up the majority of time. To reduce the time spend on file loads the usage of an SSD as temporary storage is highly encouraged.

Data Retrieval

To find and remove duplicates we first have to gain access to the images saved on the website.

… and the correct answer

I was already down to firing up Eclipse and write a hideous web scrapping script possibly using selenimum. Luckily spending 4 minutes looking through the network requests payed off and the (hidden) endpoint of the internal 9gag API surfaced.

https://9gag.com/v1/group-posts/group/default/type indubitably looks like an ordinary REST API call, and surely enough if queried it returns a bunch of meta data in a well structured JSON format. What a fortunate discovery.

{"meta":{"timestamp":1547216490,"status":"Success","sid":"9gVQ01EVjlHTUVkMMRVSzwEVJBTTn1TY"},"data":{"posts":[{"id":"amBvdPj","url":"http:\/\/9gag.com\/gag\/amBvdPj","title":"Gervais with a solution for The Oscars","type":"Photo","nsfw":0,"upVoteCount":825,"downVoteCount":27.....
,"nextCursor":"after=a1QWwXY%2Ca5MWxZG%2Ca1QWw8P&c=10"}}

Appending thenextCursor attribute to the url allows us retrieve the next batch of data. Even better, slightly modifying the url enables a more specific search.

The usage of https://9gag.com/v1/group-posts/group/default/type/SECTION_NAME/?[query=QUERY][&after=CURSOR] allows us to scrap data from different categories.

Note: The section name needs to be in all lowercase, the nextCursor token is missing at an arbitrary depth (usually around 600–1000 recursive calls), marking the end of available data.

A post on 9Gag roughly follows this life cycle:

  • new posts are published in the fresh category,
  • if they receive enough attention they are moved to the trending section
  • if they are among the top commented upvoted and shared posts they will be promoted to hot.

After downloading the metadata and saving them in a database, allowing for future SQL queries to rapidly answer more specific questions, the images were downloaded.

Source code to recreate the example. For specifics see github.

Keep an eye out for rate limiting (spam prevention) . Either my internet connection got clogged, their servers were busy or at some point or cloudflare did not like the constant requests hitting their servers while doing repeated checking to ensure code stability for the examples.

Taking a quick look at the gathered data

To stroll around in the data it is recommended to install the dbeaver eclipse extension via the marketplace, alternatively the h2 database engine allows usage via the browser. Non of these are required to proceed with the example project.

Roughly 17.000 images to work with. About 3/4 of them are images gifs. Video’s make up a small fraction of the posts. For now we will only treat the very first frame of a video or gif to find duplicates, since the thumbnail is conveniently available in the scrapped data.

Content at different sections

We can see that gifs and videos are more likely to make it to the hot section. Articles are an arbitrary combination of text, images and gifs (<0.2% of all submissions)

Example of different tables in the SQL database

Upon closer inspection it became apparent that the lifecycle isn’t exactly as described earlier. Posts can be both in trending, hot and fresh at the same time. Of the 17.000 original entries we are left with 15.000 unique posts. The hot section covers about 11 days worth of data, trending 4 and fresh 1 1/2

Interesting here is the field promoted (sadly unused). The downvote count of post, which got hidden from the website some time ago, as well as that the downvotes and upvotes are available even if the vote masked field is set to true. The vote mask indicates that the votes should be masked from the user to prevent the bandwagon effect.

Come on 9Gag api team, don’t ever submit data to the client you don’t want the user to see. Someone could go right ahead and write a browser extension to expose the details.

Data preprocessing and cleansing

Remember only a random thumbnail image being available for videos and gifs. Of the 6000 hot entries around 55 contain a black frame.

Unusable data

Due to their similarity, duplicate detection will not yield any significant results for these images. For now lets ignore them and revisit these special cases at a later time.

But how do we filter the unusable data? Taking a look at each pixel defeats purpose of perceptual hashing. Luckily jpg compression already does the heavy lifting. Images with homogeneous colors have a very tiny file size. Take a look at the peak of 4kb files. A good start, but if we naively throw out all images smaller than this threshold we also get rid of valid files like a1QRP6.jpg . The key is to cut files at a ratio between file size and image dimension. a1QRP6 while having a small file size is also very tiny. A maximum file size of 5kb and a ratio of 0.025 bytes/pixel have proven to be pretty good at filtering these frames.

After running the code a folder named invalidImages will be present in the working directory. If the filter caught a few legit images feel free to move them back to the gagImages folder.

Before we dive into duplicate detection some statistics

Here a quick plot visualizing the upvotes and comments of posts in each section. As expected, the more advanced the category the more upvotes and comments a post has.

Surprisingly a few posts in trending exists, which have a higher count than the highest upvoted post in hot. This indicates that the promotion of entries are not solely based on these two factors.

And rightly so the official FAQ states the category of an entry depends on:

Post engagement — the number of upvotes/comments/shares/views a post receives

Timely factors (posts that are timely or those with receives better engagement within a short period of time)

We don’t have information how often a post was shared but we certainly can correlate the numbers with it’s creation time. This will be tackled at a later time to create a model to predict which entries should be moved to a higher category and compare the error vs where they are now in an attempt to find manipulated entries. Just another quick chart before moving to perceptual image hashing:

Regression of comments and upvotes

We expect posts with more votes to also have more comments. This relationship holds true for the majority of the data. Some points are far away from the mean and the last time I pulled this data (some months ago) the hot regression was reversed.

Hypothesis: Vote manipulated entries will have a screwed ratio between votes and comments due to the fact that upvotes are straightforward to fabricate. Everything not normal distributed usually is worth to take a look at but we haven’t accounted for other variables which might influence these numbers like date of time and weekdays, yet.

Perceptual image hashing. Similarity approximation for images

Perceptual hashing follows a simple objective. Instead of creating an unpredictable cryptographic hash value which changes drastically even with small input changes perceptual hashes shall stay close to one another if the input changes slightly.

How does perceptual image hashing work?

The following sections describes the fundamental concepts employed by duplicate detection with perceptual hashing, getting a bit technical. Skip right ahead if this isn’t of interest for you.

Given an input image common steps usually contain scaling all images to a fixed size and aggregating a section of an image into a single value. The rescaled pixel are map to either a 0 or 1 deepening on if it’s average luminosity is greater than the images average. (This is called the AverageHash)

64 bit hash of a meme

Scaling an input image enables us to compensate for different image dimensions, the luminosity decouples the hash from hue and saturation changes. If detection of these properties is desired alternatives exist.

The above example demonstrates creation of an Average hash, reducing the 64kB original image by a factor of 8000 down to a 64 bit information. The similarity between images can be approximated by calculating the hamming distance of the created hashes.

e.g. a Hash of 01111010 and 01110011 have a normalized hamming distance of 2 and a normalized distance of 0.25 (2/8).

Computing this value is incredibly cheap as we only require an XOR between two numerical values and count the number of set bits. In other words count the number of distinct bits at the same position of both hashes.

01111010 ^ 01110011 = 00001001 => 2 bits set

Computing an image similarity score of two images with JImageHash

The Average hash is quick to compute,but still suffers detection penalty if faced with color and scale transformation. It’s quite common for re uploads to be transformed due to poor cropping, introducing noise due to multiple compression (and some people taking screenshots with their phones [like seriously guys?]).

Also, it should be noted that we are able to create hashes with an arbitrary bit resolution. The higher the bit resolution the more memory a hash will consume and the more expensive it is to compute. On the other hand, high bit resolutions track more details of the image. Finding the right value is use case depended. It should be noted that some degree of generalization is desired.

Difference hash with a high bit resolution to visualize the effect.

We have countless other ways to transform an image into a hash value.

Instead of comparing to the average, we could compare a pixel to its neighbors. Remember that rescaling takes place first so we are not comparing megapixels to each other, but in this case 64. The DifferenceHash follows similar approaches to edge detection algorithm and accounts for the gradient of an image.

We can even get more creative. First perform a rotational mapping on the image, place it’s pixel values into bins before compute the hash. This allows us to be mostly immune against rotational transformation attacks.

Compute the angle of gradient’s similar to the hash hog feature extractor. In particular the PerceptiveHash should be mentioned which takes the second discrete cosine transformation, extracting the frequencies of an image performing pretty well to distinguish images.

Left Rotational Mapping. Mid perceptive hash. Right hog feature extraction

Perceptual image hashing is an approximation technique. Images are mapped to hashes in a much smaller dimensional space, therefore hash collisions are bound to happen. We are able to counteract this issue by applying multiple hash functions to an image. Play around with the benchmark package to get a feel for the distance values generated by different algorithms.

To reduce the risk of false positives, chaining multiple algorithms back to back, incorporate different image features in the decision process. First using the fast AHash and later computing the PHash might be a good idea. It really depends on what we are after.

Create the image representation of hashes as seen in this blog post.

A naive approach finding duplicates

Now that we have defined a similarity measurement for images we could try to compute the closest distance to images and see if we can find any close matches.

JImageHash’s package matchers helps to bundle multiple Hashing algorithms back to back, as well as to compare arbitrary number of images both for in memory and persistent use cases.

Find duplicates in a batch of images

Created hashes hold a reference to the original image’s file location allowing the matcher to be used with thousands of images at a time without running out of heap space.

This specific class utilizes a lazily populated binary tree (remember the hashes are just chained 0s and 1s) to save hashes.

The beauty of this approach allows us to employ a depth first search followed by pruning the search space quickly granting good performance on a k-nearest neighbor search.

Consecutive matchers, do a pass on the first supplied algorithm (in this case the average hash with a normalized distance of .4). If images are considered a match the second algorithm (perceptive hash with a target distance of of .2), is queried.

This approach is viable to find duplicates in a huge set of images. Sadly memes are a special case due to the fact that they are similar by definition. They use the same background image, only altering a small section(mostly the text).

Lets go ahead and try it anyways.

Sample Output:

[Result Distance:0,000 Normalized Distance 0,000, Value:G:\gagImages\HOT\a73wRqL.jpg, Result Distance:7,000 Normalized Distance 0,109, Value:G:\gagImages\HOT\aA3erx9.jpg][Result Distance:0,000 Normalized Distance 0,000, Value:G:\gagImages\HOT\a73WwNe.jpg, Result Distance:8,000 Normalized Distance 0,125, Value:G:\gagImages\HOT\a5MR4WO.jpg]

The images a73wRqL and aA3erw9 are most likely duplicates. If we take a random look at some of the found images we find pairs like these

Memes get mistakenly classified as duplicates. (because, by definition they are).

The perceptual hash already does a great job at separating images and applying the very strict distance threshold of .16 ensures that we get very few false positives. The first pair demonstrates the immunity to different image sizes and saturation changes (the right image looks like a warm image filter was applied). The results found are very likely duplicates. We could call it a day and be happy that we already did well at deleting a good portion of the reposts , but there is always a way to improve results.

The strict settings result in many duplicates going unnoticed. If we increase the distance all memes are categorized as matches. Similar to other data mining algorithms a trade off somewhere in the confusion matrix has to be made. (Recall, precision and ROC curve are keywords to look up).

1. Failed Approach: Using random forest categorization.

Following a failed attempt to find duplicates is described. Skip it if you want to know how it’s done correctly.

Labeled test class generation

The issue at hand, finding the correct threshold and hashing algorithm combination.

A variation of the above code was used to find potential duplicates. Afterwards a minimalistic gui was written to display the images side by side. 15 minutes later and all potential duplicate candidates were either labeled as true duplicates or false matches at the press of a button.

After having labeled test the task can be treated as a classification/optimization problem. Known techniques of machine learning can be utilized to train a model. Support vector machines (it’s not necessarily a binary classification problem …), neural networks, or what ever you like. I did not want to label a great deal of images and there isn’t a sufficient amount of test data available just yet. Using a convolution neural network with the hash bits as inputs might be the right way to go to gain the best results, this path will be looked at at a later part of this blog post. If you want to do any additional reading take a look at person detection tasks using the hog feature descriptor. We could swap out the descriptor for the hashes and may get good results.

Darwin

At this point I usually default back to using genetic algorithms somewhere in the mix. (Please take a look at Darwin another one of my experimental projects featuring n parental genetic algorithms, for numerical and categorical problems with multi threading as well as charting support and full customization via strategy patterns [done with the shameless self promotion for now]), but the goal for JImagehash was to be self contained as much as possible and to not include any beta features.

Decision trees are easy to implement, random forests allow for better predictions, so here we go.

Two great videos explaining the concept can be found on Youtube. They served as a furst reference for the early version of the RandomForestCategorizer.

The idea, different image categories will create different distance measurements for different algorithms. If we compare the distances of the images using different types of algorithms (AverageHash, DifferenceHash(Top-Bottom,Left-Right,Diagonal Gradient), PerceptiveHash, MedianHash …) we might be able to find a threshold combination that is suited to classify the images into distinct categories.

The random forest had 2 output categories, is this image considered a duplicate or isn’t it. The variables where the distances calculated to the closest hash already in the set.

Artificial decision tree

The hope was to find a combination of rules : if AHash produces a distance of <.2 and dHash a distance > .4 while mHash is <.3 it’s probably a confession bear meme and it’s up to the tree to figure out based on the labeled test cases if it’s really a match or not. During training this went well and improved the overall detection rate by a tiny bit. Out of bag errors quickly followed and indicated a classification error of .3 for false positive. Ouch. We caught all true duplicates but do we really want to have 1/3rd of the images being classified as duplicates if they truly are not? Overfitting?

Taking a step back and you may notice that the premise simply isn’t correct in this approach. A tree can’t be constructed if the variable changes depending on the needle. The variable for a tree should not depend on the distance to the closest hash in the set, but to a fixed point instead. I didn’t catch the reason immediately (for all fellow Germans “I couldn’t see the forest due to all the trees” ;) ), the RandomForestMatcher has been degraded to Experimental status and will make a comeback by employing the Fuzzy hashes (see below) in future releases.

2. Approach Cluster images before for duplicate search

Recap time once again. We want to find duplicates and be able to classify memes (is it a confession bear or bad luck brian, and if so are they duplicates?). If we tackle the second problem maybe we can get rid of the first at the same time by employing a duplicate search just within the found cluster .

The categoize package includes classes we can utilize to group images into similar categories.

For now we use labeled test images as our cluster midpoints, at a later time we will get rid of this constraint and compute clusters on the fly. https://imgflip.com/memetemplates is a website featuring a great deal of blank memes we can use as templates to match our images against.

Snippet to download the images

The web page does not employ any fancy javascript, the download urls can quickly be retrieved utilizing the above snippet . The class TemplateImageDownloader in the example folder takes care of parsing and downloading. If you do this yourself be sure to set the user agent in both instance or the request will fail.

Now we are additionally left with 831 template images.

Empty template images

Calculating the distance of every input image to the nearest template image:

Compute the distance to the nearest meme template

Note: Usually you want to use a ConsecutiveMatcher in this case which returns all neighbors within a given distance (a lot better performance). But we will expand on the usage of the CategoricalMatcher later one.

//Confession bear cluster (It's input image 162 in my case)
if(cluster == 162){
outputDistance();
}

Before looking at the data let us hypothesis how the distances behave and as a sanity check compare with our actual results. The hamming distance of random images checked against each other should follow a normal distribution. If there are a cluster of matching images we expect a bimodal distribution with the lower mode indicating matches and the higher mode indicating noise.

X-Axis ( normalized hamming distance), Y- Axis(count of images)

Optimal Distribution Green true positives (matching) red true negatives (non matching)

Two totally distinct distributions allow to nicely pick a the threshold in between the peak (the antimode preferably) and evenly split the data.

Left Confession Bear Template image against all. Right Trimmed minor mode

As always the truth isn’t as great and we have at least some overlap. We might consider cutting at a distance of .4ish to allow for some wiggle room. Potential false positives can be handled later.

Sadly this only works for the confession bear. How about the other meme templates.

Distance of all images to the closest meme template image
Interestingly the median hash has a peak at distance 0. We will dig into this at a later time.

The charts above represent the returned distance distribution for all images checked against the most similar template image. Based on these values it’s for us to decide if an image is actually a meme or not. If it is, the cluster variable gives us the meme template the images matches to.

First we have to decide which algorithm is a good pick. The normalized hamming distance of all hashes falls in between [0 and 0.4] to the closest meme cluster. The results is reasonable since the expectancy of each bit in a hash being similar to a hash of a random image is 0.5. Having multiple hashes to choose from the probability of a closer image being present is respectively higher, therefor no image ever reaches a distance of .5. (Additionally the perceptive hash does not quite have a .5 probability of each bit being a 0 or a 1).

We are looking for a negatively skewed distribution. This indicates that some images , when compares to each other, generate a smaller hamming distance than the expected normal distribution. In this case we have images that are more similar to each other.

Now it is time to choose a cutoff value. The best approach would be to look at each category individually and choose a threshold by calculating the anti mode and give it the desired margin of error (e.g. in terms of standard deviations). But! we do not have the enough data to work with yet.

The average hash has a very high standard deviation resulting in false positives. The perceptive hash and difference hash provide cleaner alternatives with the perceptive hash being the most “crispy looking distribution. By eyeballing we could go with a value of .4 for now.

For a first improved version the code could be altered as following

if(distance < .4){
do a second pass with a different algorithm to check
if the images are still close
classify as meme and carry on with strict duplicate detection
}else{
relaxed duplicate detection
}

Image Clustering with “Fuzzy Hashes”

Meme templates are NOT! what average memes really look like. We used the blank template hashes to categorize if images are memes, what if we do not have template images, e.g. in the case of new meme categories.

Additionally distances towards the template image are artificially increase due to the fact that memes actually carry a text, which template images do not posses.

They will be close but a margin of error is introduced due to template images being “tempaltes”.

We could apply further image preprocessing at a high performance penalty not necessarily producing better results.

var hasher = new AverageHash(100);
hash.addFilter(Kernel.gaussianFilter(5,5,4));
hasher = new AverageKernelHash(64,Kernel.boxFilterNormalizedSep(4,4));

The CategoricalMatcher used earlier has much more capabilities than what was currently used. It is not depended on labeled images but can utilize an ugly hybrid of KMeans and DBScan with some twists attached to find it’s own categories. Hashes are added and grouped if they are within a given distance to the nearest cluster. Grouped hashes will form a cluster and are represented by a centeroid.

Due to your high attention you will have noticed that this is just an approximative NN search. The individual points may drop out of clusters and at some point we require to recompute and reassign the points. The more images are present the less often this action has to be performed.

Every clustering matcher features a method recomputeCategories(); taking care of the issue. Due to the transitive nature of the hashes (If hash A is close to hash B and far away from C, Hash B also is far away from A) we can speed up the process a bit by either creating binary trees or by actually doing a kMeans pass on the clusters beforehand to divide the search space similar to kd-Tree.

Now we are getting creative. Fuzzy hashes with bit probability and approximate matching. A spicy type of perceptual image hashing I haven’t seen anywhere else yet! (If there are papers or articles about this type of matching please let me know. I am more than interested in reading it).

Combining multiple hashes into one cluster

The CategoricalMatcher employs matching on the mode hash as seen above. Distance calculation using this approximation are reasonable and have the benefit of still being cheap to compute.

That being said, bits can only either be 0 or 1, but with an even number of hashes added we might run into ambiguity issue when neither 0’s or 1' prevail.

Fuzzy Image Matching

Sparing a little bit more computation power a hash can be represented as a probability of a bit being a 0 or a 1. This also allows us to compute the certainty of a bit being either a 0 or a 1.

Combining Hashes Into A Fuzzy Hash

The green and orange sections of the hash indicate bits with a high probability to be 0 or 1 respectively. The intermediate colors indicating uncertainty. As we expected the regions surrounding the text are indecisive bits due to the fact that they are not identical in all images. You are always left with some noise due to different image sizes and saturation in the aggregated images.

This insight opens up a whole different level of image matching. Remember that the bit position of a hash stays consistent with the matched image feature. Now we are able to severely punish images which perform worse on the certain parts of the image (e.g. by squaring the distances).

The WeightedCategoricalMatcher takes the weighted distance into account, but does perform respectively slower due to the increased complexity. After feeding all out images to either the Categorical- or the WeightedCategoricalMatcher we can employ the same trick as we have used before. Calculate the distance of the fuzzy hash to the template images to see which fuzzy hash corresponds to which meme category.

After that, instead of using the template images to match new images we can use the fuzzy hashes. This will reduce the error we have talked about earlier as the uncertain parts of the hash are masked. The first computation of the mask is a bit expensive, but as soon as enough images are added the mask stays consistent and does not need re computation. A one time task with enough images.

Including FileIO clustering takes around 75 seconds and recalculation of the categories are done within 10 seconds covering multiple days of data. For a 1 time operation this is fine (Times vary depending on the threshold variable, if the weighted or normal categorizer is chosen and the hashing algorithms).

The categorizes are computed without the templates images taken into account. The big advantage of this approach is that we can take a look at the bigger cluster (more entries) and certainly find the meme categories, as well as potential other memes we have no template image for.

Extracting similar memes into folders

The ~15000 images are categorized into a total of 12816 clusters of which 12676 represent non meme images. 1051 of the clusters left have more than 2 images inside them (potential duplicates). 11625 are pure single images.

Cluster with potential duplicates

After extracting the memes the above categories are left. The first cluster contains the images seen on the right. A meme category we do not have a template for. We now could take the fuzzy hash and add it to our recognition library to find these in the future.

The second, fourth a all contain screenshots of tweets or facebook post. Here the apparent weakness of the perceptive hash comes into play. While it is great to distinguish even distorted images if the frequencies are similar it has a tough job.

Right of the bet 6 reposts of the same video: http://9gag.com/gag/aZLyOPV

Category: 25 IClustDistance: 0.009 Images: 6 [G:\9Gag\gagImages\aZLyOPV.jpg, G:\9Gag\gagImages\aR1756q.jpg, G:\9Gag\gagImages\aWYx3R3.jpg, G:\9Gag\gagImages\aZLy9GX.jpg, G:\9Gag\gagImages\a0Q7vKq.jpg, G:\9Gag\gagImages\agnBrwW.jpg]

This dog has been shared 6 times as well:

[G:\9Gag\gagImages\ayBeOdp.jpg, G:\9Gag\gagImages\av8zbjn.jpg, G:\9Gag\gagImages\aLg0NgV.jpg, G:\9Gag\gagImages\arGM1Ey.jpg, G:\9Gag\gagImages\aE2Pe3O.jpg, G:\9Gag\gagImages\a9KngOm.jp

This is it for now. In the next post we will take a look at implementing a second filtering step to ensure that the matched images are actually duplicates (for the sake of simplicity this example cut short at the end and ignored everything apart more than .4 which is very very generous for duplicates.

Summary

Without taking meme duplicates into account with a naive approach we found 66 true duplicates duplicates in the hot section alone. That is a 1% repost rate with. Hot, trending and fresh combined have a stunning 5% duplication rate (99% true positive certainty[approximate]). We still have 1183 clusters with roughly 2.5 k images to look at.

We will also handle the issue that currently the category clusters need to be rebuild every time we run this example. This costs time. We can make use of serialization and cut down on all the work we have already done.

Whats is up next

Problems still to tackle are proper gif and video hashing. It is unlikely that video reposts are greatly modified but the roughly 100 blank frames still need to be addressed.

Take a look at the issue to find vote manipulated content. For this we can keep track of upvote and downvote count as well as incorporating comments and time.

Scaling this approach into the millions

The outline approach above is reasonable for the 17000 images, even 30.000 images are fine but will how does it look after a million or two million images? Clustering performance stands and falls with the number of clusters present as new images have to be checked against all existing clusters to find the nearest neighbor. A divide and conquer approach is a possible (similar to K-Means approximation already used in the CategoricalMatcher), building a tree of aggregated clusters repetitively. This works fine for a while but collapses as soon as we have to rebuild the tree from the bottom up as clusters shift around.

The initial reason to cluster the data boiled down to differentiate meme and ordinary images. Now, after computing the fuzzy hashes for meme categories, this can be done quickly by checking new images against the 700 hashes we got. (Still an approximation since we do not know if there are any non meme images which the new image is closer related to). Clustering the ordinary data isn’t necessary anymore.

If it’s not a meme we can proceed to use a combination of ordinary ConsecutiveMatcher and CumulativeMatcher which rely on the good old binary tree approach and therefore, it’s complexity only scale with a tiny fraction of the added hashes to resolve hash collisions similar to how hashmaps work.

Additionally theBloomFilterMatcher , a probabilistic data structure which allows to answer the question if an images has definitely not been seen before in constant time may help out. Chaining multiple bloom filters with low bit resolution hashes might reduce the time needed to check for expensive hash collisions.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade