Cleaning Up Collections: Detecting Duplicates with Image Fingerprints

How Urban Archive deals with duplicates

Published in

Urban Archive

4 min readJul 22, 2020

At Urban Archive, the bulk of our work centers around collections. We’re always looking to add new ones to our platform in the same way that we’re always looking to clean and enrich what we already. Very naturally, this process of combining collections from different sources and standardizing their image metadata has led to some duplication.

Why does image duplication matter?

For us, duplication is important for a variety of reasons. From a technical standpoint, it impacts search results and pagination on Urban Archive. It can also add redundancy or create confusion for users who are seeing double.

Search results for “Mount Vernon” on Urban Archive.

Take a look at the screenshot above, for example. The search query for “Mount Vernon” links us to two identical images of the site in Brooklyn. The subtext beneath the image results also suggests there are slight differences in the image metadata. Generally speaking, this kind of duplication stems from collection cataloging practices, manufactured reproductions, or even poor acquisition policies––all of which happen “pre-Urban Archive.” Our platform is great for dealing with these kinds of problems because our software can easily identify and visualize repetitive datapoints across collections and institutions. In any event, my goal in this article’s example is to see what’s happening and potentially dedupe the Mount Vernon image asset.

How we identify duplicate content

There are many different ways to attempt to identify duplication. Manually sifting through records on the map can be a good starting point, but would be inefficient for cleaning our data given the scope of our work at Urban Archive.

Image duplication of Mount Vernon Replica on Urban Archive.

At this point, we have more than 100K images located on our platform. To effectively go about this work, we’ve needed to develop a simple but systematic approach to detecting image duplicates in collections utilizing a method known as fingerprinting.

Image fingerprinting

If you’re unfamiliar with the term, fingerprinting is the process of analyzing an image and then computing and assigning a unique value to that image based on its content or visual appearance. That unique value makes the process for identifying duplicates in our image data substantially easier.

Our fingerprint tool identifies where there are image duplications within collections and across institutions.

Fingerprinting allows us to quickly search for duplicate images. The “search” results map out where there is duplication on our platform as well as where the metadata varies, if at all, as seen in the screenshot below.

The image fingerprint for Mount Vernon on Urban Archive.

How does it work?

Our image fingerprint tool detects the exact duplicates of an image based on a SHA1 hash. This is a very simple approach that can’t detect, say, cropping, or different levels of JPEG compression of the same image. A more advanced tool like TinEye or Google Image Search can detect similarity between images in addition to exact matches.

Once the duplicate is detected, our team can review the record before selecting a new primary source. Because any change made at this point in the process can potentially break links or permanently erase something important, we want to make sure we’re properly outlining the impact of selecting a new primary source.

In the case of Mount Vernon, I decided to keep the primary source and merge the changes, deleting redundant data, and moving whatever metadata was left to the selected primary source. Now, when searching for Mount Vernon on Urban Archive or finding it on the map, users are no longer seeing double.

Snapshot of a “Mount Vernon” search (left) and view of the enlarged image (right) on Urban Archive.

Closing thoughts

This tool is far from perfect and still relies on manual moderation to some extent, even if by design. That said, we think it’s a step in the right direction because it’s a low-risk incremental change in our process. No matter how robust processes like this might be, cleaning data can always get messy, and our platform will always be subject to more “mess” as new data comes in. Our goal, now and in the future, is to better understand our data and its associations so that we can consistently improve data quality over time.

Any thoughts? Drop us a comment below––and thanks for reading!