CAPTCHA 😇

Robots are not allowed

Published in

Plain and Simple

7 min readNov 27, 2020

Photo of Wall-E by Jason Leung on Unsplash

Spam is a problem for websites that give people the possibility to comment. A very common approach to solve this problem are CAPTCHAs — Completely Automated Public Turing test to tell Computers and Humans Apart. Being able to tell humans from computers apart not only helps to prevent spam but is also one way to slow brute-force attacks against login systems down or scraping content. Slowing those actions down is also called rate limiting. Having a CAPTCHA could also help with analytics. In the end, you want to analyze the behavior of people and not optimize the website for bots.

The idea behind CAPTCHAs is to give the website user a problem that is hard to solve for a computer program but easy to solve for a human.

Most CAPTCHAs are boring optical character recognition (OCR) problems, but there are more interesting ones.

Image from Oliver Widder (Geek and Poke, CC-BY-3.0)

Distorted Characters Captchas

As a developer creating a website, you can easily print text on an image. For a long time, it was pretty hard to reverse this operation with optical character recognition (OCR): Given an image with text, extract the text. A core component of scanning software. This means you could quickly generate such images, let the users enter those characters and you were done. You could tell humans from bots apart.

Over time, developers of spambots figured out ways to solve those simple captchas. The response was to change the color of the characters, add background noise, maybe some lines which cross the characters as well as rotation of the characters in the image. The result was something like this:

A typical Captcha. Image from Wikimedia Commons by GWirken (source)

Like many very simple CAPTCHAs, it lacks support for blind people. The solution is to add an alternative where you have to recognize a spoken word. That brings problems to non-native speakers.

Google did some great work with reCAPTCHA. They use the reader (or spammer) to digitalize the books they scanned. This means if someone uses very good algorithms to bypass their CAPTCHA, the spammer will help Google. This is a very nice way to end up in a win-win situation, isn’t it?

Some of the nice properties of reCAPTCHA back in the days were:

reCAPTCHA had support for blind people.
The characters you have to type in were actual words. This made it a lot easier for humans to recognize the characters.
reCAPTCHA was very easy to use. No need for the developer to install image libraries.
If you can’t read it, just reload it.

Here is a screenshot showing how reCAPTCHA looked like in 2011:

The screenshot was taken by Martin Thoma

Today, this type of ReCaptcha doesn’t exist anymore. Advances in machine learning lead to a situation where well-written bots can recognize those OCR CAPTCHAs better than humans can. It was necessary to change something. xkcd has two suggestions 😁

If you want to see more examples of solved OCR captchas, have a look at PWNtcha. Adam Geitgey shows how you can use machine learning to break OCR captchas. A similar article was written by Roberto Iriondo. There seem to also be businesses around solving captchas. That sounds incredibly fishy.

KittenAuth: Image Classification

KittenAuth, was the first time I have seen this system around 2011, but it was certainly implemented by many people with some variations The idea is simple: The user gets 9 pictures and has to spot the cats:

The KittenAuth system. Source: ThePCSpy.com

I have seen this type of captcha being used in Google products. There I had to select all busses or all pedestrians. Just as with reCaptcha and the OCR problem, users were solving a simple task and helped with an actual problem. The problem is to annotate a dataset for machine learning algorithms.

Basic human knowledge: NLP

Assuming you know the language of your visitors, you can ask them very basic questions. Questions which require context and are easy to answer for a human, but hard to answer for a machine. Text CAPTCHA is an example for this type. They ask you questions like:

Is “milk”, “hotel” or “brain” a body part?
How many letters are in the word “devotional”?
The word “tamers” has which letter in the 2nd position?
Enter the smallest number of 28, thirteen, twenty, 60, fifty six or 78:
Is the knee, leg, ear, or ankle above the waist?

I don’t think this type is very good as the spammer has to do almost the same amount of work as the programmer. He has to parse the different types of questions, but I guess this isn’t too hard. It’s hard to automatically generate many different types of questions.

The spammer might just ask Google: what is 7 minus 3 times 2? or what is the number of horns on a unicorn times the answer to life, the universe, and everything?.

Mathematics

I’ve seen some CAPTCHAs asking for basic math questions like 2 + 9 = …. They are very easy to bypass if you want to write a bot, but they combine the OCR problem with arithmetic.

Sometimes they are not that easy:

Hard math CAPTCHA — found on random.irb.hr. Screenshot taken by Martin Thoma

This is an interesting aspect of CAPTCHAs: You could also filter humans. In this case, you would filter all people who didn’t receive a good math education.

Social CAPTCHA: Face recognition

In 2011, Facebook wrote about its social CAPTCHA (source). The idea is that you know the name of your friends, but a stranger doesn’t. Here is the example:

PLAYTHRU

To get through the PLAYTHRU CAPTCHA, you have to play a short game. It was available via areyouhuman.com. I love the idea and I’m pretty certain this works well, as long as the game is complex enough and it doesn’t become too interesting to solve the captcha. Again, the issue here is that it doesn’t scale: The developer who designs the captcha game likely has to invest a similar amount of time as the attacker does.

Screenshot of the app taken by Martin Thoma

A short demo of the PlayThrough system

Technical Burdens: JavaScript

For many bots, executing JavaScript like a normal browser is quite a burden. But being a bit harder than the normal access does not mean it’s impossible. Filip Vitas shows how to bypass a slider captcha.

The Invisible Captcha explores the possibility of a JavaScript CAPTCHA. Combining this with a honeypot CAPTCHA might be powerful. The idea behind a honeypot captcha is to provide something that attrackts the bots to interact with, but is nothing humans would interact with.

Modern Systems

Recaptcha is still alive and keeps evolving. The latest version is only a checkbox which you have to tick. The algorithm how this works is not publicly available, but there is some speculation how it’s done. The idea is to take the behavior of the user on the web — not only on the single page, but over multiple pages.

Alternatives for ReCaptcha are MTCaptcha, VisualCaptcha, Geetest, and hCaptcha.

Summary

Over the years, many different creative CAPTCHA systems were developed and also disappeared. The original distorted character CAPTCHA system is still around in many places, but provides little value due to machine learning. Harder machine learning problems can be a better solution, but likely a combination will yield the best results: (1) Make it hard by requiring JavaScript, (2) add a honeypot (3) Use something else than distorted characters.

What’s next?

In this series about application security (AppSec) we already explained some of the techniques of the attackers 😈 and also techniques of the defenders 😇:

Part 1: SQL Injections 😈
Part 2: Don’t leak Secrets 😇
Part 3: Cross-Site Scripting (XSS) 😈
Part 4: Password Hashing 😇
Part 5: ZIP Bombs 😈
Part 6: CAPTCHA 😇
Part 7: Email Spoofing 😈
Part 8: Software Composition Analysis (SCA) 😇

And this is about to come:

CSRF 😈
DOS 😈
Credential Stuffing 😈
Cryptojacking 😈
Single-Sign-On 😇
Two-Factor Authentication 😇
Backups 😇
Disk Encryption 😇

Let me know if you are interested in more articles around AppSec / InfoSec!