CAPTCHA exists as a mechanism to prevent bots from spamming APIs, scraping websites, and overall abusing internet privileges intended for use by humans. From conception, CAPTCHA has evolved from simple black-and-white, separated characters in a simple sans-serif font with little to no transformation (e.g. rotation), as displayed in Figure 1, to including multiple colors, noise, varying fonts, varying scales of characters, and complex messages/words, as seen in Figure 2. While initially difficult to solve, systems eventually evolved to include image CAPTCHAs, audio CAPTCHAs, and a multi-modal user interaction CAPTCHA called reCAPTCHA, which can all be observed in Figure 3.
While somewhat obvious, the greater goal of this project is to decipher text CAPTCHAs by predicting the characters hidden in the message. However, this by itself is a difficult task since there are 36⁴ permutations of CAPTCHAs assuming the CAPTCHAs contains 4 characters where each character can be one of 36 characters (A-Z0–9). As such, instead of evaluating a supervised model on a 36⁴-way classification task, the objective is simplified to character recognition, which requires additional preprocessing through segmentation of the CAPTCHA, but creates a 36-way classification task instead of a 36⁴-way classification task.
By the time of writing this blog in 2020, text CAPTCHAs are relatively outdated. Despite this, they do exist, and there is value in using them as a toy problem for a computer vision side project to illustrate both traditional vision techniques (e.g. erosion, denoising, etc.) and deep learning techniques (i.e. CNNs). In this project, CAPTCHAs similar to the one displayed in Figure 2 will be utilized. These CAPTCHAs are generated with 4 characters at a 140x76 resolution through the use of a Python captcha library ( https://github.com/lepture/captcha/ ) and contain uppercase English letters (A-Z) and digits 0–9 for a total of 36 different characters.
This form of text CAPTCHAs is difficult to decipher due to several different characteristics. First, the image contains two forms of noise: dot and line noise. Dot noise normally would be easy to manage, especially in the form of salt-and-pepper noise, but the dots in this image appear to be of varying sizes in circular fashion, making it slightly more difficult. Meanwhile, the line noise adds an additional challenge as it intersects the characters attempted to be deciphered from the image, while at the same time being curvilinear rather than perfectly linear. Second, the image is in color with a non-white background. Binary thresholding takes care of this challenge relatively easily, but it remains an obstacle in the scenario where the background and characters are of similar color. Third, characters are overlapping. Fourth, characters are not at a fixed angle like in Figure 1. Fifth, characters are neither fixed width or fixed height in size. Sixth, characters are not at the same scale as eachother.
There are two steps to preprocessing the CAPTCHAs. First, noise is removed from the image to reduce bias learned by a trained network, in addition to helping segmentation in the next step. Second, characters are segmented in order to convert the problem into a character recognition task. As such, with N CAPTCHAs containing 4 characters each, there are 4N characters generated for our model’s character recognition dataset.
First, the RGB image is read in as a greyscale image to simplify any denoising procedures; all pixel values are in the range 0–255. Next, the image is binarized with a threshold of 230. As such, all pixel intensity values ≤ 230 are converted to intensity value 0 (black) and values > 230 are converted to 255 (white). This threshold value is purely empirical and works in pretty much every observable case after manually inspecting the images.
Now, once the image is simplified to a basic form, the first real step of preprocessing is removing noise. As mentioned earlier, the two forms of noise present in the image are circular noise and horizontal, curvilinear line noise.
The image is first inverted so the letters are white and the background is black. Then, erosion is performed with a kernel of (2, 2) for 1 iteration to weaken both the circle and line noise. Generally, with larger images, and thicker circle/line noise, this kernel should be larger, but it works for the dataset in this task.
Then, median filters come in to weaken any additional lines. The first median filter is (5, 1), representing a vertical filter of length 5, which removes pretty much all line noise present in the image. Coincidentally, this also helps weaken some of the circle noise. Then, a horizontal filter of dimensions (1, 3) removes most of the circle noise. At this point, not only have we eroded the noise, but we’ve also eroded some of the actual character data necessary to keep our characters sound and intact. So, a dilation is performed for 1 iteration with filter size (2, 2) to make the letters larger. The key thing to note here is a dilation brings the letters back to life more than any remaining noise image. However, because of the dilation, that small remaining noise might become a little more apparent throughout the image, so a final medium filter of size (3, 3) is passed over the image to remove any final weak noise remaining.
While working through the project, empirical evaluation is of utmost importance to ensure all preprocessing and training procedures are carried out correctly. As such, after some qualitative evaluation of the denoised images, some additional noise occasionally popped up. Initially, I was not going to handle this, as it required additional work for not much more of an additional benefit, but it ended up causing challenges down the road for segmentation. As such, some final procedures are utilized to remove final noise. First, the hough circle transform detects circle centers and their radii, which act as masks to remove any additional circle noise. Circles with radii between 0 and 2 are detected with a minimum distance between each circle being 1. Then, after circles are removed, the image is eroded for 1 iteration with a (3, 3) kernel to remove any edges from the circle noise. A (5, 1) vertical median filter clears out additional horizontal noise. The second to last procedure is a dilation for 2 iterations with a (3, 3) filter to restore the image, which ends up creating puffy characters and introduces a bit more noise into the image. The final step is to erode the image for 1 iteration with a kernel of size (3, 3). Qualitatively, this removes pretty much all noise, circular or linear.
By this point, the image should be almost completely noise free, only containing remnants bits and pieces of noise scattered throughout the image that intersected with characters.
Once noise is removed, the individual characters are segmented out of the image in order to train a model on the character classification task.
First, connected components of an image are detected using OpenCV’s connectedComponents(…) function. Then, the watershed algorithm attempts to further segment overlapping characters.
Ideally, with all characters non-intersecting and no noise remaining in the image, this should return 4 components, indicating 4 characters. However, sometimes there are errors indicating cases where characters are overlapping (making segmentation close to impossible) or noise being recognized as a character. The first scenario is difficult, but possible to tackle, while the second problem is quite feasible to solve. For example, in a 140x76 image containing 4 characters, the characters can only be so large and so small before the CAPTCHA is completely unreadable. As such, it was empirically determined a character mask containing less than 100 pixels is “noise”, so it is removed from any future preprocessing of the image. Second, when an image contains ≥ 2200 pixels, the character is deemed a “joint character”, which means the mask contains two characters intersection eachother. To solve this encounter, a naive approach is utilized. The mask is split directly in the middle with the left sub-mask being one character and the right sub-mask being another character. This iterative approach of removing “noise” masks and sub-dividing “joint character” masks is performed for 10 iterations. By this point, if exactly 4 masks are not generated, the CAPTCHA is thrown out of the dataset, being deemed a poor example because all 4 characters masks cannot be identified. Through statistical analysis, there is a consistent 5% error in the generated masks, indicating out of 5% out of N generated CAPTCHAs are thrown out of the eventual dataset utilized for training and evaluation. By this point in the process, preprocessing 1M CAPTCHAs takes roughly 3 hours on a Google Colab VM.
The final step of preprocessing and segmentation once characters are retrieved is to squarify them. This is a requirement since convolutional neural networks (CNNs) require a fixed size input. As such, having characters with dimensions 45x67 and 76x65 would cause the model to throw size mismatch errors. To remedy this, background pixels are added to the already-segmented character image to force the characters to be 76x76, while also being centered in the image. This entire process is observed in Figure 5, showing the 4 majors steps of preprocessing (original image, binarization, noise removal, and segmentation).
Now, after denoising and segmenting the CAPTCHAs, we have gone from N CAPTCHA images to 4N characters in the space A-Z0–9 (36 classes). It is now time for training a model to recognize these characters.
The model, while began as a pretrained AlexNet, transitioned into a custom convolutional neural network (CNN). The network can be described as follows:
2D Conv: in = 1, out = 20, kernel = (5, 5), stride = 1, padding = 4
2D Max Pool: kernel = (2, 2), stride = 2, padding = 0, dilation = 1
2D Conv: in = 20, out = 50, kernel = (5, 5), stride = 1, padding = 4
2D Max Pool: kernel = (2, 2), stride = 2, padding = 0, dilation = 1
FC: in = 24200, out = 500, bias = true
FC: in = 500, out = 36, bias = true
Compared to recent vision literature, this deep learning architecture is relatively shallow, but it gets the job done. A PyTorch implementation of this model can be found in the code within the model.py file. There is nothing special about this model, but it is useful to keep it relatively simple as more advanced/recent deep learning architectures were tested (VGG, ResNet), and they actually performed worse than this model, indicating possible overfitting to the dataset. Additionally, it is easier from a practitioner's perspective to fix/modify the architecture as needed throughout the process.
NOTE: For convolutional layers, in = # of input channels, out = # of output channels. For fully connected layers (FC), in = # of input neurons, out = # of output neurons. Finally, for the last FC layer, out = 36 because there are 36 classes.
Training for this model is a relatively straightforward process, much akin to a project dealing with MNIST classification. An Adam optimizer is used with cross-entropy loss (‘sum’ reduction, rather than ‘mean’ reduction). For the optimizer, a learning rate of 0.0001, weight decay of 0.98, and batch size of 32 were empirically found to work best. While I was training with 100 epochs, the model typically converges in less than 20 epochs.
The model used in the training phase was described above. It is important to note all images passed into the network are size 76x76x1 (third channel indicating greyscale).
Learning rate decay was experimented with, but ultimately omitted from the final training due to its lack of impact. However, batch size did play a role. It was found a batch size < 32 hurt performance significantly, while a batch size > 512 hurt accuracy only slightly.
Therefore, ultimately, a learning rate of 0.00005 was used to train the model with a batch size of 512. The batch size, while not observed to make a significant impact in the range of 32 ≤ batch_size ≤ 512, the higher batch size was chosen to speed up the training process.
Some of the most important design choices for training the model actually had to deal with the dataset. Both dataset size and partition size made decent impacts on the performance of the model. First trials were performed on a dataset of 1000 CAPTCHAs (~4000 characters). This was eventually increased to 10K, 100K, and 1M CAPTCHAs, generating roughly 40K, 400k, and 4M characters respectively. In practice, the number of characters in the dataset ends up being lower than the estimated number due to the aforementioned 5% CAPTCHA segmentation error rate from the Preprocessing section. The accuracies across these dataset sizes are illustrated in the Results section. Meanwhile, an initial train/validation/test split was carried out at a 60%/20%/20% rate respectively. However, one of the neat features of deep learning is the validation and test sizes do not need to maintain a certain ratio to the training set size to be considered “respectable” or “unbiased”. As such, once the dataset was increased to 1M CAPTCHAs, the splits transitioned into 80%/10%/10%, which still kept large, balanced partitions for the validation and test splits, while allowing the model to view more training images, thus increasing robustness in the future.
Finally, after all design choices were carried out, the model was trained. With the above hyperparameters, the training took around 10 hours on an NVIDIA K80 GPU (acquired on Google Colab).
This project, while primarily concerned with decoding CAPTCHAs, makes use of significant contribution from the character recognition sub-task. Both the main task and sub-task are evaluated in this section, but only the character recognition task is evaluated qualitatively.
The most important evaluation metric for this task is accuracy, which is quantitative. When designing a robust security system through CAPTCHA, metrics such as F1-score, precision, and recall are almost useless. Architects are only concerned with ensuring a low success rate for non-human users. In this experiment, a success for the assumptions of our dataset is when the system can identify all 4 characters in the CAPTCHA; any fewer indicates a failure. As such, computing accuracy is relatively straightforward. Intuitively, a high CAPTCHA decoding rate implies a high character recognition accuracy. So, accuracy is used for both tasks/datasets and results are displayed below.
Before discussing the quantitative results, it is important to have a discussion about expected results. In the MNIST classification task, it is common to encounter models achieving >99% classification accuracy, but the difference between that task and this task is MNIST is 10-way classification (not 36-way), and the digits do not require preprocessing as they are carefully curated to the researcher beforehand. As such, before training the model, the expected accuracy was expected to be anywhere from 80–100% accuracy. Additionally, given a model reaches X% accuracy on the character classification task, we can estimate the accuracy on the CAPTCHA task since it is a simple permutation of characters. For example, if the model reaches 80% character classification accuracy, with each class being equally likely to be successful, we expect the CAPTCHA accuracy to be (0.8)⁴ = 0.4096 = 40.96%. All of the expected CAPTCHA accuracies are observed on the right column of Table 1.
As such, a positive case is a character classification of 90%, which indicates a better-than-random-flip case of getting the entire CAPTCHA correct. However, the ideal case is a character accuracy of 95% or 99%, which leads to a confident CAPTCHA decoding model. But, enough with the theoretical discussion around expectation of accuracies. Let us get to the actual empirical results.
First, we observe the character recognition accuracies after training our model:
Then, we observe CAPTCHA success rates by utilizing the character classification model on each segmented character:
As observed in Table 2, dataset size plays a significant role in character classification accuracy, achieving almost 91% accuracy, which is a positive sign. However, when using that 91% accuracy model on full-blown CAPTCHAs (passed through the same preprocessing stage to segment the characters as the training set was), an accuracy of 44.41% is returned, which is about 20% below the expected accuracy from Table 1. This is an interesting result, not only due to the low accuracy in itself, but due to the accuracy being significantly lower than expected.
At first glance, initial assumptions are the dataset is imbalanced. However, after some analysis, this is deemed not to be the case. Plus, with a dataset of 1M CAPTCHAs and around 4M characters, it is assumed the characters are uniformly generated, so that is a second verification a class imbalance is not the problem. So, after that idea was shot down, analysis was performed on the accuracies of individual letters. As shown below, almost all the letters are at 90% accuracy or higher, with two exceptions, O and 0, both sharing low accuracy hovering around the 50–70% range. However, the reason for this scenario is simple as O and 0 are extremely similar in shape. The accuracies can be observed in Table 4.
After that analysis, I looked into the characters the model classified incorrectly. Table 5 contains the proportion of incorrectly-classified characters that belong to a specific class. This table is slightly informative, showing some characters are more likely than others to be misclassified. While an ‘O’ being the character misclassified at the highest rate is not surprising, the ‘0’ being 7th in error rate is a bit surprising, considering its low classification accuracy.
Despite these analyses, no definitive conclusions can be made why the CAPTCHA accuracy is so much lower than expected. However, despite this, the accuracy is still manageable, considering a 44% success rate is not the worst case in the world, and most text CAPTCHA mechanisms allow the user to generate a new CAPTCHA.
While quantitative results are objective and simple, qualitative results are always appreciated as well. It is not worth wasting your time showing the model predicting a label for a given character. Instead t-SNE plots were created for the digits, seen in Figure 6 and Figure 7. In Figure 6, the 76x76 character images are flattened to produce 5775 image “features”/pixels, without being passed into a classifier. Figure 7 instead passes the images through the fully trained character classification model and 500 features are extracted from the last fully connected layer. Although performing t-SNE (an iterative, non-linear dimensionality reduction technique) is enough to produce some nice-looking plots, t-SNE is computationally expensive, with a runtime of O(NlogN) or O(N²) depending on the method. As such, image features are reduced with PCA (closed-form, linear dimensionality reduction technique) to 100 components, then those components are passed into the t-SNE algorithm to produce the visualizations below. This significantly reduces the amount of time required to train the t-SNE algorithm.
NOTE: These plots say “Digits”, but they instead should be “Characters”. The plots are indeed for all 36 classes, not a subset.
One key note worth discussing is the separability of the classes. The before and after plots are astounding. With the trained model, the intra-class variability decreases significantly, while the inter-class separability increases moderately. It is worth comparing this plot from a model with 91% classification accuracy to MNIST t-SNE plots (from a model with 99.2% accuracy on only 10 classes), which can be found here: https://github.com/kingsman142/mnist-classification . Clearly, the separability is not ideal, but for a 36 class visualization, I am pleased with the results. In a 3D plot, I assume there is greater inter-class separability, but that is out of the scope of this project.
NOTE: I was planning on creating legends for both plots displaying the color of each class, but it turned out to be too much work in matplotlib to be worth it, and it would make the plots look ugly, so I avoided the hassle.
There is an endless list of assumptions made in this project to make it slightly easier to tackle, and an attempt will be made to list as many as possible:
- Line noise is always the same width in all CAPTCHAs
- Line noise is always thinner than the characters’ stroke width
- Images are always 140x76
- Circle noise is always the same radius
- Only 4 characters are used in the CAPTCHA
- CAPTCHA dataset images all contain roughly the same style
- The style of the CAPTCHA dataset is not the most complex style found across all CAPTCHA-generating Python libraries or research papers
- Line noise is always horizontal in general, and never vertical
- Characters are not hollowed out
- CAPTCHA is not 3D
- No lowercase letters are used (would convert problem to 62 class problem)
- At most 2 characters are conjoined at once
- There is enough of a contrast between background and lettering that thresholding is effective in preprocessing
On the flip side of assumptions are challenges, which are difficulties through this particular dataset that made the task challenging. They can be found below:
- Characters are not fixed width in size
- Characters are not fixed height in size
- Characters not in a fixed (x, y) location
- Characters are not at a fixed angle
- Images come in RGB and vary significantly in color
- Images have circular noise
- Images have curvilinear noise
- Circular noise is not salt-and-pepper-esque, which would make removing with a median filter extremely simple
- Line noise is not perfectly linear, so multiple combined gradient techniques and filters must be utilized to eradicate it as much as possible
- Characters are not at the same scale (some larger/smaller than others)
- CAPTCHAs only contain 4 characters
In-depth analysis of studies might be amended to this report in the future, but they are not as important as the content of the report. As such, multiple questions will be listed below containing useful inspiration questions to ask:
- Does thresholding help in the preprocessing step? Does the thresholding value impact preprocessed images?
- Does denoising help? Although the intuitive answer is yes, what if we instead just separated the characters naively? Would character classification accuracies dipped significantly, assuming the dataset was large enough?
- Does splitting conjoined characters down the middle hurt performance compared to alternate, more complex techniques?
- Does white padding of the character (when squarifying the segmented characters) matter?
- What would happen if we balanced the digits dataset completely? It is pretty balanced as it is, but would be worth investigating to see if balancing hurts accuracy by 1–2%.
Throughout this report, the greater task of solving noisy text CAPTCHAs was tackled. While the greater goal was CAPTCHAs, a necessary sub-task of character classification was carried out, achieving ~91% accuracy. However, despite this impressive accuracy, in reality, the accuracy on the CAPTCHA task is 44.41%, which is about 20% lower than expected, given the character classification accuracy. Despite this, through qualitative evaluation, the model shows an improvement in intra-class variability and inter-class separability compared to the raw data itself, which is an important and vital observation.
In the future, it is worth exploring more advanced preprocessing steps to segment characters more effectively, as well as performing data augmentation techniques and determining the root cause of the low CAPTCHA decoding accuracy.
All-in-all, I would say this project is a success. It contained a significant amount of code, requiring both traditional low-level computer vision techniques in addition to modern deep learning models. The greater goal had a sub-task which must be completed first, and satisfactory results were achieved on both tasks, leading to a practical application for the real world to solve noisy text CAPTCHAs.
The following technologies were utilized to build out the entire system in terms of preprocessing, training, and evaluation (versions are not provided):
Full code can be found here: https://github.com/kingsman142/captcha-solver .
A pretrained model can be found here: https://drive.google.com/open?id=16Vwha7uxy7coe9y-Nkh6skYPW3Kz8xZA . To make use of it, create a models/ directory in the project root directory and place the model in there, then follow the instructions in the repository above to evaluate the model.