Solving Captchas with DeepLearning — Part 3: One model to solve it all

4 min readJun 23, 2019

This is the final part in my mini-series about solving captchas. The first part demonstrated via multi-label calssification that using DeepLearning is a viable way to approach this problem. The second part went a bit further by using multiple single-character classifiers to essentially already solve the task. The notebook for this part is here.

The obvious drawback of using Part 2’s multiple-classifier approach is that we have to train and evaluate multiple models. It should be possible to train a single model to solve this task. But how would we frame this problem?

Whole text as label

The easiest way would be to use the whole captcha text as a label for the image. This would turn the task into a simple classification.

However, this would make all the captcha texts independent of each other. AUT3N would be just as different from AUT4N as it would be from 74OMU. This approach would only work if we had multiple captchas for every possible solution text. It does not use the fact that a captcha is made up of 5 parts. What else could we do?

Output position of character

In the dataset, only 19 different characters were present. We could model a captcha as a vector of length 19. Each element of the vector corresponds to one of the possible characters. The number at each entry in the vector indicates at which position the corresponding character is in the captcha. So 1 for position one, 2 for position two and so on. 0 for not present at all.

Left: Encoding vector, Middle: Encoded label, Right: Actual label

I think this would work very well. The only problem is: It doesn’t cover the entire range of possible captcha configurations. Any character could appear multiple times. This could not be represented in this approach.

Full one-hot encoding

In the single-character classification approach, the character at position i was represented as a vector of length 19. Encoding the whole captcha would lead to a 19 by 5 matrix. The columns of the matrix correspond to the one-hot encoded character at the given position:

**Left**: Encoding each position individually, **Middle**: Flattening the encoding matrix to a single vector, **Right**: Actual label

Flattening this encoding matrix leads to a one-dimensional vector of length 19*5=95. This approach is able to encode all possible captchas and can be used straight-forward for training the CNN.

Doing the encoding is fairly easy. First, we’ll enumerate the 19 different characters.

encoding_dict = {l:e for e,l in enumerate(labels)}
decoding_dict = {e:l for l,e in encoding_dict.items()}

Next, we turn a given label into the 19 by 5 matrix:

def to_onehot(label):
    onehot = np.zeros((19, 5))
    for column, letter in enumerate(label):
        onehot[encoding_dict[letter], column] = 1
    return onehot.reshape(-1)

which is then flattened into a vector of length 95.

Since I considered captcha solving a classification task, I now tried to use cross-entropy as loss function. So reshaping the networks output to the 19*5 matrix, calculating the cross-entropy for every column (=every character) and then taking the mean.

Using cross entropy loss for every position, than adding those losses up.

This worked very well in the beginning of the training, when only the linear layers were trained. However, when unfreezing the model the convergence was really slow. So instead I made it an regression task with Mean-Squared-Error (MSE) as loss function.

After some tweaks (most notably turning of dropout regularization), this model trained extremely well! After only 20 iterations I got to 99% accuracy on the validation set. Note thatt in the previous post I had to train and evaluate 5 networks to achieve the same.

This concludes the mini-series about solving captchas with DeepLearning. From the proof-of-concept in part 1, to the simple classification in part 2 I finally have a single model that solves the task at the end of part 3.

It was a nice exercise to get some familiarity in different ways of modeling a problem. If you have any questions/suggestions and/or interesting ideas on what project to tackle next please let me know!

Solving Captchas with DeepLearning — Part 3: One model to solve it all

Whole text as label

Output position of character

Full one-hot encoding

Written by Oliver Müller