Solving CAPTCHAs Using PyTorch(Without Using OCR)

“If I couldn’t pass a captcha, does that mean I am a robot?”

Saksham Sarwari
The Startup
7 min readJul 13, 2020

--

Class Activation Maps (CAMs)

After completing Jeremy Howard’s excellent deep learning course , I was wondering if I could crack real world CAPTCHAs by basic neural nets instead of using the conventional OCR technique. I decided to give it a try using Fast.ai’s PyTorch based library and was able to build a model that correctly solves over 93% of CAPTCHAs. Here is a very basic outline of the two approaches I used (working code for which can be found here):

  1. Solving for one character at a time (and its analysis using Class Activation Maps)
  2. Solving whole captcha in one go using complete one-hot encoding

Let’s start with a general question pertinent to the “current AI surge” :

Is “Deep Learning” worth the hype?

It is, indeed. The reason being the powerful underlying theorem it works on — The Universal Approximation Theorem. In simple terms, UAT says that:

you can always come up with a deep neural network that will approximate any complex relation between input and output.

Dataset

The images in the dataset look like this:

eigment

The dataset has following properties:

  • Each captcha consists of either 6 or 7 characters
  • A character might appear multiple times in a single captcha

The label for each image is given by its filename. ‘eigment.png’ for the example above. This allows for easy extraction of the labels while training.

Part-1: One character at a time

In this approach, we train the model for each character-position separately i.e. create a separate model for each position. Then, we use these models one-by-one on an image to solve for each character of captcha.

Solving for the first character of captcha

This is a typical classification task. The input is the image of the captcha, the output is a single label corresponding to the first character.

First character training sample with labels

Now we train the model as a normal image classification problem. For this, we first define our learner object with a particular architecture (ResNet-50 here) and then start training by the usual “fit_one_cycle” function:

Result after 14 epochs

Initially, I trained only for 14 epochs because I wanted to see which pair of characters the model struggles the most with.

Actual vs predicted labels

From this classification matrix, we can clearly see the “difficult” pair. Unsurprisingly it’s “m” & “n”. Intuitively also, I’d say that “m” and “n” are hard to distinguish.

We now train for complete 20 epochs and achieve an accuracy of 98% for the first character.

Result after complete training

Similarly we train for each position of captcha (from 1 to 7). For the CAPTCHAs having only 6 characters, we assign a unique label “X” as the 7th character for those CAPTCHAs. This works pretty well and the model starts to learn the white-space at the end of image (of 6 characters) as label “X”.

CAM (Class Activation Map)

CAMs tell us about the part of the input image that is important in the classification process. For the above example, we’d expect to have high activation around the first character.

Similarly for positions 2,3 and 4,

Class Activation Maps

Training all 7 classifiers (one for each position)

We loop our previous code to train all the positions:

Here are the final accuracies I got for each position:

  • First character : 98.34%
  • Second character : 97.16%
  • Third character : 96.45%
  • Fourth character : 94.09%
  • Fifth character : 94.44%
  • Sixth character : 96.57%
  • Seventh character : 98.46%

Running all 7 models on a test case,

Hence, we have successfully solved our captcha.

Now let’s look at another approach where, instead of solving character-by-character, we solve the whole captcha simultaneously using one-hot encoding.

Part-2: Whole captcha in one go

One naive approach is to use the whole captcha text as a label for the image. This would make it a typical classification task. However, since all the images have unique captcha, this approach gives each image a different label, and each label will have only one corresponding image. “ankaser” would be just as different from “ankarse” as it would be from “jazched”. It does not use the fact that a captcha is made up of 7 parts. The model trained using this approach might “hard-learn” these training images, but clearly, it won’t be predictive when run on new data. Hence, let’s think of some other approach.

We know, 26 different characters (a-z) are present in our dataset. We can model a captcha as a vector of length 26 where each element of the vector corresponds to one of the possible characters (a-z). The number at each index of the vector indicates the position at which the corresponding character is present in the captcha. So 1 for position one, 2 for position two and so on. 0 if the character is not present in the captcha.

Using this encoding, for the captcha “damble”, we have :

Left: Character domain, Middle: Encoded label, Right: Actual label

But still, there’s one problem. Any character can appear multiple times in the same captcha. This can not be represented in this approach. We can overcome this problem by employing complete one-hot encoding.

Full one-hot encoding

In the character-by-character classification approach, the character at position i was represented as a vector of length 26. Encoding the whole captcha would lead to a 26 by 7 matrix. The columns of the matrix correspond to the one-hot encoded character at the given position. Flattening this encoding matrix leads to a one-dimensional vector of length 26*7=182.

Left: Encoding each position individually, Middle: Flattening it to a single vector, Right: Actual label

After some tweaks (applying weight decay and other regularization techniques), this model trains extremely well! After 70 iterations we get to 94% accuracy on the validation set.

Making a “Captcha-solver” web-app

Using flask (a micro web framework written in Python), I developed a small web-app so that this trained model can be used by anyone, regardless of their knowledge of Python or Deep Learning.

At the homepage, user uploads an image of a captcha.

After uploading the image, the app loads the trained model in the back-end, inputs the uploaded image to the model, and then prints the result as shown below:

You can find working code for the web-app here.

This concludes my blog-post about solving captchas with DeepLearning.

At the end, I would like to again emphasize on the power of Deep Learning and the underlying Universal Approximation Theorem:

A feedforward network with a single layer is sufficient to represent any function, but the layer may be infeasibly large and may fail to learn and generalize correctly.

— Ian Goodfellow, DLB

This is an incredible statement. If you accept most classes of problems can be reduced to functions, this statement implies a neural network can, in theory, solve any problem.

Maybe someday I’ll take the derivative of my up-votes and update my writing-style in the direction that maximizes views.

[1]: Oliver Müller. (June 8, 2019). Solving Captchas with DeepLearning
https://medium.com/@oneironaut.oml/solving-captchas-with-deeplearning-part-1-multi-label-classification-b9f745c3a599

[2]: Brendan Fortuner. (March 8, 2017). Can neural networks solve any problem?
https://towardsdatascience.com/can-neural-networks-really-learn-any-function-65e106617fc6

[3]: Michael Neilson. (Dec 26, 2019). A visual proof that neural nets can compute any function
http://neuralnetworksanddeeplearning.com/chap4.html

--

--