Decimate reCAPTCHA

Rajat Sharma
Intel Student Ambassadors
5 min readJul 23, 2019

reCAPTCHA is a free service provided by Google that protects our website from spam and abuse. It uses an advanced risk analysis engine and adaptive challenges to keep automated software from engaging in abusive activities on our site. It does this while letting our valid users pass through with ease. reCAPTCHA is built for security. Armed with state of the art technology, reCAPTCHA is always at the forefront of spam and abuse fighting trends so it can provide us an unparalleled view into abusive traffic on our site.

Figure 1: reCaptcha

Easy for People. Hard for Bots

Purposefully designed and actively aware-:

· reCAPTCHA knows when to be easy on people and hard on bots. Hundreds of millions of CAPTCHAs are solved by people every day. reCAPTCHA makes positive use of this human effort by channeling the time spent solving CAPTCHAs into annotating images and building machine learning datasets. This in turn helps improve maps and solve hard AI problems. Google recently unveiled the latest version of reCAPTCHA. The goal of their new system is twofold; to minimize the effort for legitimate users, while requiring tasks that are more challenging to computers than text recognition.

· ReCAPTCHA is driven by an “advanced risk analysis system” that evaluates requests and selects the difficulty of the captcha that will be returned. Users are required to click in a checkbox, or solve a challenge by identifying images with similar content. This is intended to allow only legitimate users access to your website.

· The reCAPTCHA service offered by Google, is the most widely used captcha service, and has been adopted by many popular websites for preventing automated bots from conducting nefarious activities.

Challenge Type:

There are various ways of throwing ReCAPTCHA challenge, where google check if the user is a bot or human. We can find some ways on the internet to break the text reCAPTCHA and reCAPTCHA Audio challenge but the most popular reCAPTCHA challenge i.e Image reCAPTCHA has been a major hurdle for data aggregation companies.

· Image reCAPTCHA: This new version is built on the notion that identifying images with similar content. The challenge contains a sample image and 9 candidate images, and the user is requested to select those that are similar to the sample. The challenge usually contains a keyword describing the content of the images that the user is required to select. The number of correct images varies between 2 and 4.

Figure 2: reCaptcha Challenge

How to break “Google reCAPTCHA” ?

· As an artificial intelligence enthusiast I always think about computers mimicking human beings. So some natural questions popped up in my mind: “Why do they use reCAPTCHA in this site?”, “Can I break this captcha using what I know of machine learning?”

As such, reCAPTCHA based MFA has been a major hurdle, while scraping the sites. We propose a completed automated script that can handle a variety of challenges thrown by Google’s reCAPTCHA and bypass the Captcha Barrier without human intervention. This can be achieved using state of the art Machine Learning and Neural Network Architectures.

Proposing the solution:

· As we want to automate the ReCAPTCHA solving, we had to use machine learning techniques to achieve that. We have used Neural Network based object detection algorithms to solve the ReCAPTCHA challenge on behalf of the user, thus providing a more user friendly product. The Java agent which tries to access the site using selenium, cannot pass the the sites which have ReCAPTCHA challenge. Thus, we can have a script which scraps the ReCAPTCHA challenge which is getting thrown in the selenium browser.

· We can train an Object Detection model with the images that reCAPTCHA throws usually. Through that model we can detect the objects that are coming in the reCAPTCHA Images. After getting the correct images, we can get the dimensions and coordinates where we need to click through selenium api. We can then click on the images correctly through selenium api in the ReCAPTCHA challenge.

Figure 3: Architecture of the model

Designing the Solution (Softwares):

· For Object Detection Model, we can use darknet neural network framework and YOLO (you only look once), a real time object detection system which is extremely fast and accurate. We can use COCO dataset for training purpose. Intel Optimized TensorFlow framework can be used for the training the darknet yolo model.

· For training and computation purposes, the Intel AI DevCloud powered by Intel Xeon Scalable processors can be used. Intel AI DevCloud can provide a great performance bump from the host CPU for the right application and use case due to having 50+ cores and its own memory, interconnect, and operating system. We can adjust the weights and threshold accordingly to make our model very accurate.

· We can write our script in python 3.5.2. This script would be containing the code for scraping the ReCAPTCHA image, running the trained model , getting the dimensions and coordinates of the images detected through the model and clicking the images correctly on the ReCAPTCHA challenge.

· Other softwares include selenium, chrome, java etc.

Figure 4: Python script Figure 5:ReCAPTCHA Challenge
Figure 5: reCAPTCHA Challenge
Figure 6: Running the neural network object detection model
Figure 7: Result of object detection model on ReCAPTCHA challenge.
Figure 7: Clicking on the right images using selenium api
Figure 8: I am not a robot

We are able to successfully break the google reCAPTCHA using a bot without any human intervention. It seems that there is a mismatch between what site creators think a bot can do and what a bot can really do using neural networks

--

--