How my CAPTCHA completely destroys AIs

Ivan Arena
5 min readJun 29, 2023

--

With the progress made in the field of Artificial Intelligence, most modern CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) have now become obsolete: there are several AI models for public use capable of recognizing texts as well as objects — oftentimes the subjects of CAPTCHAs— in a matter of a few milliseconds.

One way to make CAPTCHAs more effective could be that one of adding more and more “noise” to the words or the objects (or else) showed, but AIs have proved to work quite good even in noisy environments; furthermore, the tests should always remain reasonably easy for the humans to solve, or they would lose their use.

The solution I came up with manages to be extremely effective at deceiving even the most recent state-of-the-art models, all while still resulting almost immediately solvable for humans.

How to beat AI

When designing the CAPTCHA, I tried to keep it simple, thinking about what it is that humans can do better than computers and the answer was actually pretty straightforward: there is something seemingly hard-wired inside our brains, supposedly as a result of evolution, that makes us incredibly fast at pattern matching, specifically, the process of associating an element possessing certain rough features with one we have already seen in the past; this process is almost instantaneous and totally unconscious. Now you could argue that computers are also quite good at pattern matching and that is certainly — often — true, but my point here is that we are far better and I will prove that to you with my CAPTCHA.

The concept is really basic: you pick an image of some subject, you pick another one, of a different subject and you put one on top of the other, with some degree of transparency, resulting in a single image showing two subjects overlayed.

A shark, + (symbol), a chair, equals a shark-chair overlayed image
Source: Shark, Chair. https://osf.io/jum2f/; Shark+Chair, Own work

Take a look at the next few examples and try to ask yourself how much time it takes for you to distinguish the two subjects of each picture (I’m taking for granted that you will, in fact, succeed in distinguishing them).

Source: Own work

Now, take a look at how this two famous image segmentation models have performed and compare that to your performance.

As you can see, none of the subjects has been recognized by the AI. If that’s not enough to convince you, let me show you the tests I have run, using a batch of generated CAPTCHAs.

The results in detail

Below there’s a test suite that runs two widely used AI models, one for image classification and the other for object detection, against a batch of generated CAPTCHAs (in this case, 100 for the image classification test and 1000 for the object detection test) and compares the answers given by the models with the actual ones.

=============================================== ResNet50 Tests =================================================================
Answers = ['baseball', 'donut']; Found: NONE (classified as ['pickup', 'pickup truck', 'police van']);
Answers = ['cat', 'owl']; Found: NONE (classified as ['green snake', 'grass snake', 'electric locomotive']);
[...]
Answers = ['banana', 'chair']; Found: NONE (classified as ['snow leopard', 'ounce', 'Panthera uncia']);
Answers = ['baseball', 'owl']; Found: NONE (classified as ['leaf beetle', 'chrysomelid', 'weevil']);
=================================================================================================================================
Images analysed: 100; subjects recognized: 0.
=================================================================================================================================
============================================== DETR-ResNet50 Tests ==============================================================
Answers = ['car', 'donut']; Found: car (detected with confidence 0.999);
Answers = ['baseball', 'beer']; Found: NONE (['sports ball', 'sports ball'] detected with confidence [0.948, 0.986]);
[...]
Answers = ['orange', 'owl']; Found: NONE ([] detected with confidence []);
Answers = ['beer', 'owl']; Found: NONE (['vase', 'vase'] detected with confidence [0.924, 0.988]);
=================================================================================================================================
Images analysed: 1000; at least one subject recognized: 513; both subjects recognized: 35.
at least one subject recognized: 51.3%; both subjects recognized: 3.5%.
=================================================================================================================================

As you can see, the image classification always fails, not being able to recognize even one of the subjects portrayed, ever. The object detection, instead, actually performs good — although evidently not good enough — managing to detect only one of the subjects roughly half of the times and, sporadically (usually a percentage that lies between 2% and 4%), both the subjects. This results translate in a 96.5% of success (in this particular test suite). Moreover, the percentage of the times when the model recognizes both the subjects is fairly ridiculous, so much that the bad-performing CAPTCHAs could even be removed by hand, since they are always the less noisy and you can see spot them by eye. Given these facts, I think I can confidently conclude that my CAPTCHA beats AI (at least for now).

From the idea to the product

Over a weekend, using Node.js along with Express, I developed a solution to automate the generation of these CAPTCHAs and allow anybody to make use of them through an API. All the code is publicly available and fully documented on my repository, which also features a basic React demo for you to try out the system. In the repository, you can also find a Jupyter Notebook file with customisable crash tests, that run ResNet-50 and DETR-ResNet-50 — widely used models for image classification and object detection — against the CAPTCHAs.

Source: Own work

--

--