Breaking captchas from scratch (almost)

Javier Segovia
lemontech-engineering
6 min readDec 27, 2016

Breaking captchas using ImageMagick + Tesseract

For those who need to automate tasks or just extract data (web scraping) from a site, you may encounter with the old and annoying captchas (because they don’t have a public API we can request gently)

This post is intended to show how to solve this type of captchas (not google recaptchas) using ImageMagick and Tesseract

If you’re not familiar with Tesseract, it’s an OCR (Optical Character Recognition) created by HP and that’s what we will use it to recognize characters in a image, OCR is not the ultimate solution, because the training it’s based on clean images, so to get the best results from Tesseract, you have to optimize the images. To achieve that, you should do the following:

  1. Clean: Captchas usually will present “noise”, to avoid a simple OCR from breaking it and to force real humans to solve it. These noise can be dots, stripes, distortions, and so. That means we have to remove every non-alphanumeric content from the image.

2. Binary image: To help OCR, the image should be optimized, this mean that pixels have to be black or white, the image should not have gray areas (or color area, duh!). It would be useful in order to understand characters if only “true positive” pixels exist, that way features and patterns can be detected.

3. Remove blank spaces: Avoiding processing blank spaces will improve performance (less pixels to read) and results.

4. Configuration: Tesseract has tons of configuration, but for this task, just a few of them will be useful. https://github.com/gali8/Tesseract-OCR-iOS/wiki/Advanced-Tesseract-Configuration

Now let’s write an example!

This project will be written using ruby, so we’ll need rmagick and rtesseract gems to interact with ImageMagick and Tesseract.

First, let’s write a class that generate captchas

This code its extracted from rcaptcha, I edited it in order to add custom image resolution.

To get a captcha, we should write

require_relative 'captcha'
captcha_path = 'captcha.jpg'
text = 'foobar'
width = 400
height = 200
text_size = 80
captcha = Captcha.generate text, width, height, text_size
File.open(captcha_path, 'wb') { |f| f.write(captcha) }

This will give us an image like this

FYI, if you send this image to an OCR, you probably get a perfect result, it’s easy to read, but if you combine uppercase and lowercase letter with numbers, you probably will not succeed. Anyway, this post it’s to show a way to solve it.

For testing, we’ll create a random text for captchas using faker:

...
require 'faker'
text = Faker::Lorem.characters(6)
captcha = Captcha.generate text, width, height, text_size
File.open(captcha_path, 'wb') { |f| f.write(captcha) }

Now let’s improve the image to extract characters with OCR.

We should understand the captcha, we know that it always have a center-ish text, with six characters: lowercase letters and numbers. Characters are blue, and there’re a lot of dots (noise) with different colors, even blue as same as the characters. So, let first crop the image with the known location

# Read image
img = Magick::Image.read(captcha_path).first
# args X, Y, width, height
img.crop! 50, 60, 300, 80
img.write 'captcha_solved.jpg'

Then, we are going to reduce image size to reduce computation, be careful when doing this, because as smaller the image, less information. In this case, the information are the available pixels to read from.

img.scale! 0.75

Now, transform image into gray colour scale:

# transform image into gray scale 
img = img.quantize(128, Magick::GRAYColorspace)

We are doing this to help us to clean noise. Now we can convert those pixels into white pixels below a threshold we define, in this case I selected 180 pixels (this pixels is based on 256 * 256 colors).

# convert into white everything below the 
img = img.white_threshold(180 * 256)

Those dots remaining are equal or similar with the character’s colors, but believe it or not, it help us a LOT!, now let’s convert this pixels colors into “binary colors” just 0 or 255.

# transform image into binary colors
img = img.quantize(2, Magick::GRAYColorspace)

We still got this annoying dots remaining, if you see close, there are white dots or blank spaces inside our characters, because they’re relative small, they can be harmless, but what if they’re bigger? it could be a huge problem because OCR will maybe recognize it as a different character, just imagine a “8” with one of it’s curves erased by an blank spot turning into a “3” and vice-versa.

This is a issue that we need to handle and to do that, we can “average” neighbors pixels, converting each pixels into the average value inside a ratio:

I like to add some blank border to image to clean everything possible in a ratio

# Add border to avoid noise there
img.border!(5, 5, 'white')

Because there are less black dot noises than white spaces, I’ll start cleaning those using a ratio of 2 pixels because, there are just 1 pixel dots.

process img, 'white', 2

And then, fill those blank spots inside the characters with bigger a ratio:

process img, 'black', 3

Now let’s try to make the edges soft by using gaussian blur:

# soft edges
img = img.gaussian_blur 0.5, 0.5

And lastly, we trim the image to remove blank spaces:

img.fuzz = 1
img.trim!

Now we got our image ready to read, remember the tesseract configuration I told you before? well, there are also some parameters you use when you run tesseract:

https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage

But we need 3 things that are essential:

  1. PSM: Because this kind of captcha is a single “word”, we set “7” as value to trait the image as a single line.
  2. LANG: Tesseract has default training sets for different language, and because we we’re not reading special characters we set english “eng” as language.
  3. OPTIONS: We can set a lot of parameters to OCR, but let’s keep it simple, just set a whitelist for the characters we can expect from the image, in this case we need “abcdefghijklmnopqrstuvwxyz1234567890”.

For options, we need to create a file with the parameters, and save it with the name you want in the tesseract config directory,

# /usr/share/tesseract-ocr/tessdata/config/captcha
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz1234567890

And finally, lets try to solve the captcha through the OCR

require 'rtesseract'
solved_path = 'captcha_solved.jpg'
img.write solved_path
text = RTesseract.new(solved_path,
lang: :eng,
options: :captcha,
psm: 7)
text.to_s_without_spaces # => "16qe9o"

If we pass the original image to the OCR, returns a empty string because it couldn’t understand the text.

I created a repo with this example, and with a script to test our solution, and it has a 80% accuracy that it’s really useful, because if we can’t solve any particular captcha, we can refresh and try again.

Conclusion

There’re more ways to solve captcha, a smarter way is using Machine Learning, you can create a dataset of multiple types of captchas with different noises and use it to train and get better accuracy, but if you can’t do that, I hope this post can be useful for you.

--

--

Javier Segovia
lemontech-engineering

Software Engineer, Game Dev, AI enthusiast (In Skynet We Trust), hobbyist photographer and sarcasm native language speaker. CTO at sosafeapp.com