Scraping from 0 to hero (Part 5/5)

Ferran Parareda
6 min readNov 25, 2021

Anti Scraping

This article is focused on how to properly scrape a website without any limitations. But sometimes you want to avoid them (as much as you can).

Just to do a quick review of the most common ways detected and blocked while scraping are:

Blacklisting

Limit the request per minute by IP address, it’s one of the best and most common methods out there , for its simplicity, As it’s easy to specify and control this list on your server. In case you increase the number of requests per second, the IP will be automatically avoided for a short period of time. If the IP insists on doing it, the period of time will be extended a little bit more (exponentially).

In the same structure that we saw before, someone with a browser from China (with Chinese language) accessed through Thailand to an Argentinian website, Is it normal? Maybe yes, but not so often. When this happens, the headers of the request will not be related to the source’s IP address location.

I have to say that all the requests without headers set, will be discarded, of course.

Google Bot, is it really you?

Check that the bots that are entering into your website are the real ones; it’s easy to check the IPs and their DNS. You can find more information here. If a website realizes you are sending a lot of requests camouflaged as Google bot, but not in the range of IPs of the giant, they can avoid them without control. The same thing with Bing.

Robots.txt

One of the self-explaining measures that mitigate a little bit the overuse of the scrapers is the information that the website can provide to everyone with the robots.txt standard.

Clarification: robots.txt and sitemaps.xml are different. Sitemaps.xml lets the crawlers know what are the urls of the website. Robots.xml gives the directives to the crawlers to crawl in a proper way.

I recommend to follow the google directives or robots standard exclusions. Just keep in mind, the more you expose in your robots, the worse. For instance, using the sitemaps can be good for the search engines, but very good for the scrapers (because you will give them the entire website, url by url, in a silver plate).

Captchas

Another countermeasure that you can use are the captchas that you can find on the big companies websites (Google, Amazon or even Facebook). When they suspect something weird (really fast access from the same IP in a short period of time or when you don’t specify the headers), the website automatically puts a puzzle to be solved as a captcha. This is a nightmare for the scrapers, because it means a complex strategy to scrape.

Hidden Javascript or honeypots

When a person is scraping using a bot, the bot is usually stupid as a stone. Because maybe the bots are reading the created HTML, and the final one rendered by Javascript, if the bot is a little bit smarter , and with the help of some tool, it can read the generated HTML also.

Sometimes, the website is creating honeypots in order to check if the visitor is a bot or not. , by using different strategies, each one can be measured and have its own KPI:

  • Go from one page to another: how many pages per minute is able to see one unique IP? If the number is unusually high, maybe it’s a bot.
  • Filling forms: How fast is this human filling the forms? If a “human” is able to fill a form and press submit in less than 1 second, it’s pretty strange, right?
  • Checking the size of the browser’s window: this looks pretty stupid, isn’t it? But it’s very useful. Like the resolution of the window. This information can be a good indicator too
  • Fonts of the website: imagine you are a human. Humans really need a specific font to read, but not for a bot. For a bot, it does not matter if the font is Comic Sans or Arial. Well, please, not Comic sans. Never!!.
  • Velocity on press buttons: if the velocity is usually exactly the same from the same IP (it’s faster than expected), it could be another good indicator
  • Not able to set Cookies: most of the robots are not able to set up the cookies. This could mean that the human is not using the browser
  • Suspicious range of IPs: how often should a user change IPs from AWS, Google Cloud or Azure Cloud? The possibility is almost nonexistent. That’s why you should avoid one of the most common cases: to deploy the crawlers: on different cloud servers
  • Combine images and text: text in images can be a 100% showstopper for scraping.
  • Use fake or dumb data into the HTML code (not visible from the human perspective): a website cab introduce some hidden (non rendered) HTML code (hidden for humans but visible for bots) in order to confuse the scrapers (this will not work for scrapers that use screen-scraping)
  • Requesting all the webpage or only part of it? If a user is requesting only some parts of the page, definitely is a bot, because a human will require all the resources (images, CSS and so) in order to see it properly in his/her browser
  • Use AJAX in order to get information: if a bot is not executing JS, how can it retrieve the data, if the website is loading information only by AJAX? This is a really good way to start hiding information from the less sophisticated bots (this will not work for scrapers that use screen-scraping)
  • Avoid exposing the internal APIs or even auto incremental identifiers: if you see in the url an identifier like www.page.com/resource/1234, the scraper can understand that the resource 1235 will be another one. It’s a good idea to use this in a honeypot.. But if you are dumb enough to expose the identifier to the world, be aware of the consequences
  • Create/use a tool to change the schema (or only names and classes) of the HTML: a scraper is using the text in HTML as a map to search the data to extract. If you constantly change the points in the map (as class names or the id of the elements), the scraper will get lost while getting the information
  • Change the content depending on the location or the header: can you imagine that the same website can give you different content if you navigate it from Utah (U.S.) or if you connect from Oporto (Portugal)? This can be a big problem for those scrapers that are trying to retrieve data using world-wide IP proxies
  • (For me, the best one) Change the data when you detect bot: what can be more useless than giving false information to someone that is overusing your infrastructure only to retrieve data? Personally, it’s the best way to say: back off from my website and don’t come back again, strange!

And probably the best way to avoid being scraped, is offering a good API. Even if it’s paying a small amount of money ($5/month), you will be more happy, because you will be controlling:

  • Who is getting access to the resources
  • What are the resources to provide
  • Less traffic in the website
  • Less control against scrapers
  • Control over the requests
  • Some income

To sum up!

This article is part of the list of articles about Scraping from 0 to hero:

Conclusions

Nothing else. I have covered almost everything about scraping. Sometimes in a big picture and sometimes in very low level (giving examples). I hope that you have a clearer idea , and also, now you can run your imagination to scrape the information that you need.

If you want to know more about this topic or want to reach me, you can comment below or reach me on Twitter or LinkedIn.

--

--