How To Solve CAPTCHA While Scraping The Web?
It is unlikely that you didn’t need to enter a captcha to prove you are a human when scraping data from large-scale websites. If you are a web data scraper, you may already understand why cybersecurity professionals had to create them. They were a result of your bots automatedly requesting endless amounts of data from your website. As a result, even genuine users had to cope with captchas that appeared in various forms. Despite this, you can bypass captchas whether you are a web scraper or not, which is the purpose of this article. Let’s take a closer look at what captchas are first.
What is a CAPTCHA?
It stands for Completely Automated Public Turing test for differentiating computers from humans. The website is used to differentiate real humans from bots using a challenge-response test. For most websites, CAPTCHA tests are used to determine whether a real user or a bot is attempting to access a page. Originally, CAPTCHA tests, which first appeared in the late 90s, consisted of distorted images containing random letters and numbers.
Despite the fact that CAPTCHAs are meant to deter automated bots, they themselves are automated. They appear in specific parts of a site, and consequently, they pass or bomb clients.
How does a CAPTCHA work?
In classic CAPTCHAs, which are still used on some web properties today, users must identify letters. The letters are distorted so that bots will be unable to recognize them. Users must interpret the distorted text, type the correct letters into a form field, and submit the form to pass the test. Users are prompted to try again if the letters do not match. Login forms, account signup forms, online polls, and e-commerce checkout pages use such tests.
The idea is that a computer program such as a bot will have difficulty interpreting the distorted letters, whereas an individual who is used to seeing and interpreting letters in various contexts — different fonts, different handwriting, etc. — will usually be able to identify them.
In many cases, bots will be unable to do more than input some random letters, which makes it statistically unlikely that they will be able to pass the test. Therefore, bots fail the test and are prevented from interacting with the website or application, while humans are able to use it normally.
Using machine learning, advanced bots are able to identify these distorted letters, so these CAPTCHA tests are being replaced by more complex ones. Google reCAPTCHA has developed a number of other tests to identify human users from bots.
Types of CAPTCHAs
Spammers and cybercriminals can use artificial intelligence to solve easy captcha challenges with their computer programs. Therefore, CAPTCHA tests have evolved and become more complex over time. CAPTCHAs today come in different shapes and forms:
text-based CAPTCHAs
This is the most common type of CAPTCHA you will see on websites. Users must enter the displayed word (or words) in order to pass the test, which usually consists of disjointed, blurred, elongated, or otherwise distorted text. The displayed text is usually obscured by a blurry, spotted, or coloured background, making things slightly more challenging.
Text CAPTCHAs have been criticized as authentication methods. Due to the random nature of these tests, sometimes they are difficult to read, especially for people with visual impairments.
Image CAPTCHA
Several images are presented to the users, and they are instructed to pick the ones containing the object. Using this type of CAPTCHA is very effective since image recognition is easy for humans (possibly even easier than text recognition), but bots and computers struggle to recognize image patterns.
Google uses its street view images and clever artificial intelligence to produce CAPTCHA images on the fly. It is now clear why you are always clicking on street signs, lamp posts, and fire hydrants.
Audio-based CAPTCHA
As many people as possible must be able to solve the challenge when using CAPTCHA. The majority of text and image CAPTCHAs allow users to click on a speaker button as an alternative method of testing. Using audio-assisted text CAPTCHA, the generated voice spells out the letters or numbers or mentions words that begin with the specified letters.
The user will have to solve an audio challenge if they click the headphones button on a visual CAPTCHA. In order to complete the challenge, you must enter several numbers correctly.
Besides CAPTCHAs as described here, there are several other kinds:
- Math solution: In order to proceed, users need to solve a simple math problem (e.g., 6+5).
- Word problem: In a word problem, users may rearrange letters, input a colour, or state the last word of a sentence.
- Social media sign-in: Users can sign in with their Google or Facebook accounts.
- Time-based: Users who exhibit bot-like behaviour (completing forms within a fraction of a second) are automatically blocked.
- reCAPTCHA v3: This new version of reCAPTCHA works behind the scenes, detecting bots and triggering actions without requiring user interaction.
What is reCAPTCHA?
Google offers reCAPTCHA as an alternative to traditional CAPTCHAs as a free service. The reCAPTCHA technology was developed by researchers at Carnegie Mellon University and acquired by Google in 2009.
The reCAPTCHA test is more advanced than a typical CAPTCHA. Some reCAPTCHAs require users to enter images of text that computers are unable to read. In contrast to regular CAPTCHAs, reCAPTCHA uses real images: street addresses, text from books, old newspaper articles, etc.
As time has passed, Google has expanded the functionality of reCAPTCHA tests so that they no longer have to rely on the old method of identifying blurred or distorted text. Some of the other types of reCAPTCHA tests include:
- Image recognition
- Checkbox
- User behavior assessment (no user interaction at all)
What you need to know about bypassing captchas when scraping the web
You now have a clear understanding of what a captcha and Recaptcha are, how they work, and when they’re triggered. Now let’s look at how captchas affect web scraping.
Because most scraping operations are carried out by automated bots, captchas can hinder scraping the web. Do not be discouraged, however. There are ways to circumvent captchas when scraping the web. Let us first dive into what you need to know before you scrape.
Excessive requests to the target website
The first step is to ensure that your web scraper/crawler does not send too many requests in a short period of time. In their terms and conditions pages, most websites mention how many requests they allow. It is important to read these before beginning scraping.
HTTP headers
Whenever you connect to a website, you send information about your device to the website. The company may use this information to adapt content to the specifications of your device and to track metrics. Once they determine that the requests originate from the same device, any subsequent requests will be blocked.
So, if you developed the scraper on your own, then you can change the header info every time your scraper makes a request. As a result, it would appear to the target website that it is receiving multiple requests from different devices.
IP address
A further fact you need to be aware of is the fact that the target website does not blacklist your IP address. If you use your scraper API /crawler API too often, they will likely blacklist your IP address. In order to resolve the above issue, you can use a proxy server which masks your IP address.
By rotating the headers and proxies (more on this in the next section) with a pool, you ensure that multiple devices can access the website from different locations. Consequently, you should be able to scrape without interruption from captchas. Having said that, you must ensure that you are not detrimental to the website’s performance in any way.
However, you should be aware that proxy services cannot help you bypass captchas on registration, password change, checkout, or other forms. This can only help you to avoid captures triggered by bots on websites. In a future section, we will examine captcha solvers to avoid captchas in such forms.
In addition to the above key factors, it is important to know the following captchas when scraping the web using a bot:
Honeypots- A honeypot is a type of captcha enclosed in an HTML form field or link, but its visibility is hidden by CSS. So when a bot interacts with it, it has inevitably identified itself as a bot. Make sure the CSS properties of the element are visible before making your bot scrape the content.
Social Media sign-in — Some websites require that you log in using your Facebook account. The problem is that they are not very popular because most administrators are aware that people will not sign up with their social media accounts.
Time tracking- In order to determine whether you are a human or a bot, these captchas monitor how fast you perform an action, such as filling out a form.
Bypassing captchas for web scraping
Use rotating proxies and quality IP addresses
You should rotate proxies each time you send a request to the target website, as discussed in the previous section. It is one way to avoid triggering captchas while scraping. In such cases, you need to use residential IP proxies.
Rotate User agents
In order to scrape the web, you will need to disguise the user agent to a popular browser or a supported bot — a bot that websites recognize, such as a search engine bot.
The simple act of changing the user agent will not suffice as you will need to prepare a list of user-agent strings and then rotate them. If this rotation occurs, the target website will perceive you as having multiple devices when in fact, one device is sending all the requests.
The best practice for this step would be to maintain a database of real user agents. Delete cookies when they are no longer needed.
Captcha Solving Services
Using a captcha solving service is a more straightforward and low-tech approach to solving a captcha. Artificial Intelligence (AI), Machine Learning, and a combination of other technologies are used to solve captchas. Death by Captcha and Anti-Captcha are two of the most prominent captcha solvers in use today.
Avoid direct links
If your scraper accesses a URL every split second, the receiving side will be suspicious. A captcha would appear on the target website.
You may wish to set the referer header in such a way so that it appears to be coming from another page to avoid such a scenario. If you do so, you are less likely to be identified as a bot. Alternately, you could have the bot visit other pages before visiting the desired link.
Conclusion
Businesses need to scrape the web in order to gain insights and make data-driven business decisions. The information above explains what CAPTCHAs are, how they are created, and some methods for bypassing them. You should be aware that not all CAPTCHAs have the same triggers and difficulty levels — these depend both on the website’s security and your actions. Please keep this in mind, and good luck with your web scraping.