A background showing a captcha, and a person covered in a hoodie | Breaking captcha with machine learning
A background showing a captcha, and a person covered in a hoodie | Breaking captcha with machine learning

Cybersecurity, Machine Learning, Technology

Breaking CAPTCHA Using Machine Learning in 0.05 Seconds

Machine learning model breaks CAPTCHA systems on 33 highly visited websites. The concept bases on GANs

Roberto Iriondo
Dec 19, 2018 · 6 min read

December 19, 2018, by Roberto Iriondo — Updated May 5, 2020

Everyone despises CAPTCHAs (humans, since bots do not have emotions) — Those annoying images containing hard to read the text, which you have to type in before you can access or do “something” online.

CAPTCHAs (Completely Automated Public Turing tests to tell Computers and Humans Apart) were developed to prevent automatized programs from being mischievous (filling out online forms, accessing restricted files, accessing a website an incredible amount of times, and others) on the world wide web, by verifying that the end-user is “human” and not a bot.

Nevertheless, several attacks on CAPTCHAs have been proposed in the past, but none has been as accurate and fast as the machine learning algorithm presented by a group of researchers from Lancaster University, Northwest University, and Peking University showed below.

Image for post
Image for post
Figure 1: Overview of the approach. The researchers first use a small set of non-synthesized CAPTCHAs to train a CAPTCHA synthesizer. (1) the CAPTCHA synthesizer is then used to generate synthetic CAPTCHAs, which at the same time (2) the synthetic CAPTCHAs are used to train a machine learning base solver, (3) which is refined to build a fine-tuned solver of non-synthesized CAPTCHAs. | [1]

One of the first known people to break CAPTCHAs was Adrian Rosebrock, who, in his book “Deep Learning for Computer Vision with Python,” [4] Adrian goes through how he bypassed the CAPTCHA systems on the E-ZPass New York website using machine learning, where he used deep learning to train his model by downloading a large image dataset of CAPTCHA examples to break the CAPTCHA systems.

The main difference between Adrian’s solution and the solution from the research scientists from Lancaster, Northwest, and Peking is that the researchers did not need to download a large dataset of images to break the CAPTCHAs system, au contraire, they used the concept of a generative adversarial network (GAN) to create synthesized CAPTCHAs, along with a small dataset of real CAPTCHAs to create an extremely fast and accurate CAPTCHA solver.

Generative adversarial networks, introduced by Ian Goodfellow along with other researchers [2], are deep neural net architectures comprised of two neural networks, which compete against the other in a zero-sum game [3] to synthesize superficially authentic samples. These are especially useful in scenarios where the model does not have access to a large dataset.

Figure 2: Targeted CAPTCHA security features. Sample examples were collected from Baidu, Sina, Microsoft, and JD captcha schemes. | [1]

The researchers evaluated their approach by applying 33 text-based CAPTCHA schemes, 11, which are currently being used by 32 of the world’s most popular websites ranked by Alexa. Including CAPTCHA schemes being used by Google, Microsoft, eBay, Wikipedia, Baidu, and many others. The machine learning model used to attack these CAPTCHA systems only needed 500 non-synthesized CAPTCHAs instead of millions of examples as other attacks before this one (such as Adrian’s) have proposed.

Image for post
Image for post
Figure 3: List of text-based captcha schemes used as training data, along with testing of the machine learning CAPTCHA solver. | [1]

Once the model was initialized with the CAPTCHAs security parameters in mind shown in Figure 2, it was used to generate a batch of synthetic CAPTCHAs to train the synthesizer with the 500 real CAPTCHAs obtained from the various CAPTCHA schemes shown in Figure 3. The researchers used 20,000 CAPTCHAs to train the pre-processing model along 200,000 synthetic CAPTCHAs to train the base solver.

The machine learning prototype was implemented using Python. The pre-processing model is built using the Pix2Pix framework, which was implemented using Tensorflow. The fine-tuned solver was coded using Keras. [1]

Image for post
Image for post
Figure 4: Real Google CAPTCHAs and the synthetic versions generated by the researchers’ CAPTCHA synthesizer | [1]

After the generative adversarial networks were trained by using the synthesized and real CAPTCHA samples, the CAPTCHA solver was used then to solve CAPTCHAs from highly visited websites, such as Megaupload, Blizzard, Authorize, Captcha.net, Baidu, QQ, reCaptcha, Wikipedia, and others. The unique approach of this method is that most of the sites CAPTCHAs were solved with over 80% success rate, exceeding 95% on sites like Blizzard, Megaupload, and Authorize.net, an attack method that has proven to have better accuracy on all other prior methods to solve CAPTCHAs, which used sizeable non-synthesized training datasets.

Image for post
Image for post
Figure 5: Compares the researchers’ CAPTCHA solver against four prior attack methods to solve CAPTCHAs. | [1]

Other than enhanced accuracy, the researchers mentioned on their paper that their approach was not only more accurate but also more efficient and less expensive to implement that other methodologies proposed [1]. Besides being the first GAN based solution for text-based CAPTCHAs, it is an open door for attackers to use, hence their effectiveness and inexpensiveness to implement.

Nevertheless, the approach has some limitations, such as the use of CAPTCHAs with variable numbers of characters. The current approach uses a fixed number of characters — if it’s extended, the prototype breaks. Another is the use of variable characters on the CAPTCHA. While the prototype can be trained to support this change, it currently does not as is.

It is crucial for highly visited websites to use more robust ways to protect their systems, such as bot-detection measures, cyber-security diagnoses, and analytics, along with multiple layers of security such as device location, types, browsers, and others. — as they are now and even easier target to attack.

DISCLAIMER: The views expressed in this article are those of the author(s) and do not represent the views of Carnegie Mellon University, nor other companies (directly or indirectly) associated with the author(s). These writings are not intended to be final products, yet rather a reflection of current thinking, along with being a catalyst for discussion and improvement.

References:

[1] Yet Another Text Captcha Solver: A Generative Adversarial Network Based Approach | Guixin Ye, Zhanyong Tang, Dingyi Fang, Zhanxing Zhu, Yansong Feng, Pengfei Xu, Xiaojiang Chen, Zheng Wang | Lancaster University, Northwest University, Peking University | https://www.lancaster.ac.uk/staff/wangz3/publications/ccs18.pdf

[2] Generative Adversarial Networks | Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio | Department of Computer Science and Operations Research, University of Montreal | https://arxiv.org/pdf/1406.2661.pdf

[3] Zero-Sum Games | Game Theory | Stanford University | https://cs.stanford.edu/people/eroberts/courses/soco/projects/1998-99/game-theory/zero.html

[4] Deep Learning for Computer Vision with Python | Adrian Rosebrock | https://www.pyimagesearch.com/deep-learning-computer-vision-python-book/

[5] Gao, H., Tang, M., Liu, Y., Zhang, P., and Liu, X. Research on the security of Microsoft’s two-layer captcha. IEEE Transactions on Information Forensics & Security 12, 7 (2017), 1671–1685

[6] Gao, H., Wei, W., Wang, X., Liu, X., and Yan, J. The robustness of hollow captchas. In ACM Sig | https://www.lancaster.ac.uk/staff/yanj2/ccs13.pdf

[7] Mohamed, M., Sachdeva, N., Georgescu, M., Gao, S., Saxena, N., Zhang, C., Kumaraguru, P., Oorschot, P. C. V., and Chen, W. B. A three-way investigation of a game-captcha: automated attacks, relay attacks, and usability. In ACM Symposium on Information, Computer and Communications Security (2014), pp. 195–206

[8] Yan, J., and Ahmad, A. S. E. A low-cost attack on a Microsoft captcha. In ACM Conference on Computer and Communications | http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.295.9469&rep=rep1&type=pdf

Towards AI

The Best of Tech, Science, and Engineering.

Sign up for Towards AI Newsletter

By Towards AI

Towards AI publishes the best of tech, science, and engineering. Subscribe with us to receive our newsletter right on your inbox. For sponsorship opportunities, please email us at pub@towardsai.net Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Roberto Iriondo

Written by

I work with the web, marketing, and data | For Authors @towards_ai → https://mktg.best/z-fvc | 🌎→ https://www.robertoiriondo.com | Views & opinions are my own.

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Roberto Iriondo

Written by

I work with the web, marketing, and data | For Authors @towards_ai → https://mktg.best/z-fvc | 🌎→ https://www.robertoiriondo.com | Views & opinions are my own.

Towards AI

Towards AI is a world’s leading multidisciplinary science publication. Towards AI publishes the best of tech, science, and engineering. Read by thought-leaders and decision-makers around the world.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app