How reCAPTCHA Works

Plus, how to cheat it, and how it contributes to the common good.

bmelton
5 min readJan 21, 2014

Almost everybody has seen that reCAPTCHA, the Google-owned web service that protects forums, blogs and websites from spam, but very few people seem to realize how it works, or that its results are used by Google’s book digitization project, or that usually, you can bypass a CAPTCHA by only getting it half right.

courtesy of http://irevolution.net/2013/06/17/recaptcha-for-disaster-response/

Book Digitization?

As mentioned, Google uses reCAPTCHA to assist with their book digitization efforts. OCR software performs the brunt of the work in converting print copies of old books into editable text, but OCR, even in this day and age, is still not quite good enough to accommodate for blurry or indecipherable images.

Note that, in the above, the OCR output for the scan is far from incomplete, and that the result of the OCR scan, while almost certainly better than having had to type the text in manually, is still far from being confidently asserted as ‘good’, or really, even close. “This” is interpreted as “nils”, “portion” is approximated to “pntkm”, and so on. Cross-referencing these results against a dictionary is often-times able to rule out nonsense words, such as pntkm, but even that would have flaws as well. I hesitate to imagine how The Call of Cthulhu might have been interpreted by OCR alone.

The fix, of course, is human involvement, but not in the way you might think. It would of course be very expensive and slow to have a human read through every instance of scanned OCR output, and very likely, the person hired for the job would not be terribly keen on the job. Beyond that, even dedicating a team of people to the task doesn’t completely prevent against human error (nor may that be 100% possible).

If a single person were to read “The Call of Cthulhu” they may be able to isolate the majority of the errors most of the time, but without comparing each word in the scanned text to the OCR output, which would be a very arduous process, the human interaction is still likely to result in errors.

You can improve the confidence of the results by having multiple editors read the same OCR text, comparing it to the same original reference book, then comparing and contrasting the results, but that of course requires more labor and more time. Moreover, the more editors and cross-checking you add, the more confidence you impart to the process.

Google, however, uses another process, that I like to call ‘Surprise Editing’.

OMG. What is Surprise Editing?

Surprise Editing is the process of letting millions of people read portions of the text, having them all translate it, saving those results to a pool, and then cross-checking the results. You’ve probably already figured out, that’s exactly what the reCAPTCHA system is.

Each CAPTCHA is comprised of words that were scanned in from a paper book, and whose translation Google has low confidence in the accuracy of. If you’ve ever remarked that reCAPTCHA looked a lot like old print copies of books, you were really onto something.

So how does Google know if the answers are correct if it doesn’t know what the words are? Bonus points if you’re also asking why CAPTCHA mechanisms tend to have at least two words in the clue. That’s the real magic behind Google’s reCAPTCHA system. For each CAPTCHA image you see, Google must present two words; One of the words being presented is a ‘clean’ word, that they know the meaning of, and the other is a ‘dirty’ word, that they do not.

When I mentioned earlier that you could usually pass a reCAPTCHA quiz by getting it only half right, that is why — Google can only confidently assert that the word is what they think it is on one of the words in the CAPTCHA, because they are counting on you to tell them what the other one is. Sneaky, eh? Let’s look at an example.

In the above image, let’s assume that “modern-day” came from a dictionary of known words, while the word “trieste” comes from a book that Google has scanned. In that case, you would be able to successfully complete the CAPTCHA by typing “asdfa modern-day”. Assuming we guessed correctly, in that “modern-day” is the known-good word, and “trieste” is the questionable one, you’d be allowed to bypass the CAPTCHA and continue on to whatever it was protecting. Because you successfully interpreted “modern-day”, Google assumes that your result for “trieste” is probably pretty good, and your translation of it (asdfa) goes into a database of possible translations.

Google continues to show “trieste” as a possible word to thousands of other folks entering results into reCAPTCHAs, and after enough successful attempts, they should reach a consensus that the actual word for the image of “trieste” that they have is actually “trieste” in plain text.

Of course, if you fail to get the “modern-day” portion correctly, your results are not saved, you do not pass the CAPTCHA, and you are presented with another one to try. Of course, the trick to getting it half-right is that you don’t necessarily know which word is the known-word, and which is the scanned word, but with a little trial and error, you should be able to get past a CAPTCHA by only getting one of the words correct.

Feel free to play with the reCAPTCHA on the top of this page (http://www.google.com/recaptcha/learnmore).

So How do the Audio reCAPTCHAs Work?

Good question, but with a very simple answer; The audio reCAPTCHAs use only known-good words, and presumably, from a subset of the master dictionary they use of known-good words that also have audio recordings.

Altogether, the reCAPTCHA system is pretty clever, and a great way to gamify information scouring. Google’s pretty good at this — they used to have an image game where you ‘competed’ with another person to describe an image shown on the screen within 30 seconds, sort of a “Battleship” of images. Google used the ‘agreed upon’ words between the two players in much the same fashion as reCAPTCHA is implemented.

Also, reCAPTCHA is for a good cause. When I first learned about how reCAPTCHA worked, years ago, I made it a point to ensure that I always used reCAPTCHA versus any of its competitors to ensure that I was at least doing a little bit of good in getting books into digital formats in the process.

--

--

bmelton

Student of the Constitution. Civil Rights Advocate. Programmer.