The Biggest Web Scraping Roadblocks and How to Avoid Them
Hello there. As you may already know, web scraping is very useful for both business and research purposes. The popularity of this data and information extraction is growing, and so, websites are also trying to keep up by developing countermeasures to block spiders from crawling their products.
While web scraping, you might have encountered difficulties getting the results you so desired because the website found out you are using a bot. Is there a way to avoid this?
That’s what this article is about. I will tell you more about what spiders are used for and show you how to scrape the web without worrying about anti-bots measures.
So, fasten your seatbelt if you want to tag along!
Why use a web scraper
Using a web scraper is very useful when you want to extract information en masse in a short period of time. There are many use cases for which businesses use these tools on a daily basis. They can also be used for research purposes, such as machine learning.
Here are a few examples of why web scrapers are so popular:
- Lead Generation
- Machine Learning
- Price Optimization
If you are interested in knowing more, you can look at my other articles where I present web crawlers’ use cases in more detail.
How websites identify and block web scrapers
Some of the use cases presented above are the reason why most websites don’t wish to be scraped. We can both agree that sharing information with your competition isn’t a good idea, so they want real users to be crawling on their websites instead of spiders and bots.
Websites can identify web crawlers by tracking a browsers’ activity, verifying the IP address, planting honeypots, adding CAPTCHAs, or even limiting the request rate to avoid spam.
Let me explain to you more about how these web scraping countermeasures work and how you can bypass them to continue extracting data carefree.
Browser fingerprinting is a technique used by a website to collect information about the user and associate their activity and attributes to a unique online “fingerprint”. The website runs scripts in your browser’s background to find out your device specifications, what type of operating system you are using, or your browser settings. Moreover, it can find out if you are using an ad blocker, user agents, what language you are browsing in, your timezone, and more.
All these attributes are then knit together into a unique digital fingerprint that follows you around the web. This way, it is easier for them to detect bots because changing your proxy, using incognito mode, clearing your cookies or browser history won’t change the fingerprint.
How can you prevent browser fingerprinting from disturbing your web scraping? A good way of doing this is by playing pretend. A headless browser will behave like a real browser but without any user interface wrapping it. Using a Chrome driver is a popular approach.
Websites nowadays have their ways of detecting headless browsers, and even if you implement solutions on your local machines, working at scale might be a problem, as running more instances at the same time will use a lot of resources (RAM).
Transport Layer Security is a security protocol developed over a previous protocol called Secure Sockets Layer (SSL). HTTPS is the secure version of HTTP. The difference between the two is that HTTPS uses TLS or SSL encryption over the HTTP protocol.
This type of fingerprint is similar to the browser fingerprint presented above but identifying users using TLS. When the user connects to the server, an exchange of requests will be made between the two. This procedure is called a “TLS handshake”, and if it’s successful, the protocol will describe how the client and the server will communicate with each other.
You can check what fingerprint your browser generates with TLSfingerprint, and even see how popular it is among users. TLS fingerprints are built around these “TLS handshakes” by using a set of parameters such as Handshake version, Extensions, TLS versions, and more.
How can you change this fingerprint to scrape the web more stealthily? Well, replacing the TLS parameters isn’t as easy as it sounds. If you randomize the parameters, the TLS fingerprint would be so rare, it would be categorized as a hoax directly. Your best is to use a tool that helps you modify your parameters. Here’s a module that might help you with a Node.js-backed scraper.
As silly as it sounds, even IP addresses can have a criminal record. There are a few ways for a website to detect if an IP address is suspicious:
- If you are using a free proxy pool, chances are that they are already banned by that website. Either someone else used them already, or the developer added them to said websites’ blacklist beforehand. If you could find those proxies, he could’ve too.
- Using IP addresses provided by data-centers can also be considered suspicious because they have the same subnet block range, making them easily detectable.
- Some websites may consider IP addresses from different geographical locations suspicious as well. Its contents could be available only to certain countries or geographical areas. While this isn’t necessarily suspicious, it might prevent you from accessing all the content you want.
Using residential IP addresses, which are a person’s personal network, is a good solution for these problems. They are entirely legitimate IP addresses coming from an Internet Service Provider, so they are less likely to be blocked. Content restricted by one’s location won’t be a problem either, as a good proxy pool has IP addresses from different locations worldwide.
IP rate limiting
Rate limiting is a strategy adopted by websites to limit the number of requests made by the same IP address in a certain amount of time. If an IP address exceeds that number, it will be blocked from making requests for a while.
This type of bot countermeasure can be quite irritating while web scraping en masse on the same website as it can slow down my data gathering, but that doesn’t mean we cannot evade such discomfort.
One solution would be to add delays between each request on purpose. A more suitable solution would be to send requests to the websites you want to scrape from different locations by using a proxy pool. Switching to different IP addresses between requests makes it difficult for rate limiters to act.
I am sure you have encountered a CAPTCHA verification when surfing the Internet. This type of anti-bot measure is commonly used by websites to confirm that an actual human is behind the computer screen.
CAPTCHAs are usually displayed to suspicious IP addresses, so a quick solution would be to retry the request using a different proxy. In other cases, using a CAPTCHA solving service would be the optimal solution, such as 2Captcha or anti-captcha.
There are various types of CAPTCHAs. Some may involve solving a simple math problem, word recognition, or identifying objects in pictures. Google, for example, launched in 2014 the now popular reCAPTCHA that tracks the users’ movement and figures. It may be a simple task for users, but it tends to be difficult for bots, as they tend to be very methodical and check the box right in the middle. If this verification fails, reCAPTCHA can require another test, similar to the ones presented above.
Note that even if you manage to avoid this anti-bot measure by solving the CAPTCHAs or retry with a different proxy, your data extraction can still be detected.
I hope this story was good guidance for your web scraping project, and now you know how websites can secure themselves to prevent your data extraction process.
There are a lot of factors to take into account and implement a bypass for each to scrape a website without being noticed and blocked. This can be quite time-consuming if you ask me, but not impossible with the right approach.
Have you ever thought of using a 3rd party software, like an API, that can cover all these troublesome matters for you? Have a look at WebScrapingAPI. It’s a trustworthy solution for developers who don’t have the time to build their own tools from scratch.