Why we created ‘Imageid’ and saved 47% of the moderation effort

Diego Essaya
Taringa!
Published in
8 min readMay 2, 2016

Images are a key component of the content that users publish on Taringa!. In a previous post we described the set of processes and tools that allow us to moderate user generated content. In this article I will describe Imageid, one of the tools that we developed to solve a very specific problem: recognizing similar images.

Our manual image moderation tool in action

In order to decide if a Post is categorized as Brand & Family safe, we have to categorize each and every image on the Post and all its comments. For this, we developed an internal tool that allows to moderate images as they are published. As soon as our moderators started using this tool, they noted that some groups of images were being moderated repeatedly, over and over again. “I already approved/banned that image! Why am I moderating it again?”

This is because our tool identifies an image uniquely from its URL. But the same image can be uploaded and re-uploaded many times, so it can live in many different URLs. This happens too frequently in our content:

  • Several users have the same avatar picture
  • Some users like to separate sections of their content using images, and they use the same “separator” repeatedly (I know, right?)
  • Memes are used all over the place

This was dampening the efficiency of our moderation team, because a large part of their effort was being wasted by moderating the same images that were already moderated.

The solution?

So we thought: can we automatically detect that the same image has already been moderated? This raised the question: what is the definition of “the same image”? How can we recognize that two images living under different URLs are effectively the same?

Well, it is very easy to recognize that two files are exactly equal, just by comparing bytes. But this solves just a part of the problem: very frequently an image is downloaded in one format, say jpg, and then converted into another format, say png, and re-uploaded. They are the same image, but they are definitely not the same file!

What’s more, images can be resized, cropped, and applied all sorts of effects like color correction. And then we have image editing: adding / removing small parts, overlaying text, etc. At which point the image stops being the same image and needs moderation again?

So this is what we set out to build: a system capable of automatically identifying groups of similar images.

First idea: hashing

Imagine we already have a database containing every image published in Taringa!. Let’s say 50 million images. A user publishes a new image under a new URL. We want to know if there is a duplicate of the image already stored in our database. How are we going to compare this new image against all others?

If we compare the images one by one, the process will surely be too slow. Remember: comparing for exact equality means comparing byte by byte. If an image, on average, contains 1 megabyte, this means that our whole database contains 50 terabytes! Reading and comparing 50 TB each time we receive a new image is just not practical.

This is where we start applying some concepts from Computer Science. Instead of storing the whole image in our database, we can store a hash calculated from the image. A hash is like a small fingerprint that results from applying a deterministic math formula to the whole image. One of the most widely used hash functions for files is MD5, resulting in a 128-bit (16 bytes) fingerprint. MD5 has these key properties:

  • If two images are exactly the same (i.e., same format, same pixels), they end up having the same MD5 hash.
  • If we change just one pixel of an image, we end up with a completely different MD5 hash.
  • It is probable, but very unlikely, that two different images end up having the same MD5 hash.
MD5 hash calculated from three different images. The MD5 hash is usually expressed in hexadecimal notation, where each character represents 4 bits

If instead of storing the images in our database we store the MD5 hash of each one, we go from 50 terabytes down to 800 megabytes! And we have another key advantage: hashes are indexable, which means that a database engine can search among millions of hashes in milliseconds.

So when we receive a new image to process, we calculate its MD5 hash and search it in the database. If it is found, we can say that we found the exact same image living under two different URLs. Otherwise, we store the new hash in the database.

Perceptual hashing

We still haven’t solved our “similar images” problem. Remember that MD5 works only on exact duplicates; if we change just one pixel, or if we convert the image to another format we end up with a completely different MD5 hash, so we will not detect it as being the same image.

Turns out there is a different form of hashing called perceptual hashing. The key attribute of a perceptual hash is that if two images are similar, they should end up with a similar hash.

One way of obtaining a perceptual hash is applying a series of image transformations: scale down to 8x8 pixels (resulting in a total of 64 pixels), remove colors, convert to black & white (1 bit per pixel) using the average intensity as threshold, and then extract each of the 64 pixel values, resulting in a 64-bit hash.

Steps to calculate a perceptual hash from an image

In our solution we ended up using a combination of two different perceptual hash algorithms: pHash and dHash, resulting in a 128-bit hash for each image (in addition to the 128-bit MD5 hash).

Searching similar hashes

Now we have in our database the perceptual hash for each of the 50 million images. We receive a new image for processing. We compute its perceptual hash. Can we efficiently search for a similar hash present in the database? Traditional databases do not provide this feature, but we can implement it ourselves using metric trees.

A metric tree is a data structure that allows to efficiently search elements based on distance. Have you ever used your phone to search for restaurants close to your location? How does that work? The map database surely contains millions of restaurants, most of them in remote parts of the world. Does the software go through each and every restaurant in the world and compute its distance to you? That would be very inefficient. Instead, all the restaurants in the world are stored in a metric tree.

Let’s imagine we are searching for restaurants in a small neighborhood in Buenos Aires. The first element of the metric tree is a random restaurant which, say, happens to be located in Poland. Let’s call it restaurant A. The rest of the restaurants in the world are divided in two groups: those that are near A, and those that are not. With near we mean closer than, say, 10,000 km. Since we are in Buenos Aires, we are ourselves in the far group. So we can rule out all restaurants that are in the near restaurant A group. And this effectively rules out half of the database! We can keep applying the same process to the remaining restaurants: on each step we rule out a large part of the database until we end up with a tiny group of restaurants near our location. Actual mapping software uses a different form of metric tree, but the concept is similar.

Metric tree constructed from 6 points in the plane

Practical uses of metric trees are not limited to geographic databases. The same concept can be applied to any problem in which elements can be compared for distance. So, remember our perceptual hashes? Can we compute a distance between two of them? Yes, we can compute the Hamming distance!

The concept of Hamming distance is very simple: it is the amount of bits that are different. Some examples for 8-bit hashes:

  • 10010111 and 10010111 have a Hamming distance of 0 (since all bits are exactly the same).
  • 10010011 and 10010111 have a Hamming distance of 1 (the different bit is highlighted).
  • 11010011 and 10010111 have a Hamming distance of 2.

Since we are working with 128-bit perceptual hashes, the maximum possible distance between any two hashes is 128. We define that two hashes are near if their Hamming distance is at most 8.

Go Imageid, go

All the concepts explained so far allowed us to develop Imageid, our similar image indexing service. This is how Imageid works upon receiving a new image to process:

  • Download the image.
  • Calculate the MD5 hash. If the hash is already indexed, the new URL is stored as similar to the original image.
  • Calculate the perceptual hash, and search for similar hashes in the database. We consider two images similar when their perceptual hashes have a Hamming distance less than 8.
  • If no similar hashes are found, store the URL, MD5 and perceptual hashes, marking it as an image not similar to any other one.
  • If a similar hash is found, assign the new URL as similar to the original image URL.

Imageid was developed using the Go programming language. We picked Go for this project since we wanted a robust service capable of multitasking: it would have to download several images simultaneously, process downloaded images and accept queries from external services. We were not disappointed: Go offers the concept of goroutines which make programming multithreaded systems a breeze.

We deployed Imageid on March 2015. At the time of this writing (May 2016) it has indexed a grand total of 45 million images. Of those, it detected that 27% are exact duplicates of some other image (i.e., they have the same MD5 hash), and 47% are similar to some other image (i.e., their perceptual hashes have a small Hamming distance). This effectively saves 47% of moderation effort!

One nice thing about Imageid is that it does not require too much hardware to run. Currently it is deployed on a single server with an i7 processor and 16GB RAM, hosting both the Imageid service and the MariaDB database. Imageid currently processes ~150 images per minute and responds to ~100 queries per second.

Imageid is one of the most fun projects I ever worked in: I learned the concepts of perceptual hashing and metric trees, and in the process I also added a new programming language to my toolbelt. Oh, and did I mention that Imageid is open source?

Imageid is the first open source project published by Taringa! We are very happy to be part of the open source community, and we hope to see Imageid used as part of other projects.

--

--