reCaptcha : How we are training Google’s AI by proving “I am not a Robot”

Jyoti Malhan
3 min readMay 28, 2020

--

W e have been part of a spectacular journey. Unknowingly, we have contributed digitising millions of books and magazines containing vast human knowledge.We have helped in bringing autonomous cars on roads by making AI capable of complex object recognition.

CAPTCHA is an acronym for “completely automated Turing test to tell computers and humans apart”). It is the human validation test used by many sites to prevent spam.

reCAPTCHA is a reversed CAPTCHA — the same test used not only to prevent spam but to help in the book digitization project.

How we trained Google’s AI to read text?

In text based reCaptcha, there are two words, out of which one word is a real test from Google’s dataset of known words and the other one is from the dataset of unknown words. To know what is written in the unknown word, google made us write the text of both the words in which one was the real test and the other word is yet to be transcribed but being a user we don’t know which word was for the test, we will try to strive both the words without any error.

So, like this millions of us were given unknown words and we effortlessly taught Google’s AI to read by providing the high accuracy feed to reCaptcha.

Text based reCaptcha

How we trained Google’s AI to recognise objects?

In 2012, Google introduced reCaptcha with a collage of photos, making users to label traffic lights and signs on the public places. And then it comes to many day to day life images like roads, shops, cats, crosswalks, dogs and everything else we can think of.

Image based reCaptcha

Let’s understand with an example, like in the below snippet we need to find out the rivers, the scenario is same as text, here one already recognised river image is a real test and rest five are unknown images is to train the system. We can get any number of river images in the set. But at least, one of them will be real to check that we are human and whether we are training the system accurately or not. In the same way, everyone will try to tick the images very consciously and will give the right input like million others are doing and the result will be google getting a new image in the database.

Google made us do the boring work of data preparation 😁

The whole reCaptcha thing is a good lesson for all of us on how to create data to train AI.

In the reCaptcha V2 & V3 Google tracks our movement (scrolling, clicking) to check if we are human or not. What Google is trying to achieve through V2 & V3 — that’s a story for the next time!

--

--