Scraping from 0 to hero (Part 2/5)

Ferran Parareda
5 min readNov 4, 2021

Basic scraping

The main concept of scraping is really easy. Create a simple scrapy script to get data from Wikipedia.org (see also their Terms of Use).

After doing 2 simple commands to create the project and set it up, you can start to code. The commands are:

> scrapy startproject simple_project
> cd simple_project
> scrapy genspider wikipedia wikipedia.org

Once you have executed these commands, you can open your PyCharm to start to develop.

https://gitlab.com/fparareda/scrapy_simple_project/-/blob/master/simple_project/spiders/wikipedia.py

The other part of the scraper is really simple:

  • Function __init__: set up the initial functions and properties. I always prefer to set the url to start to scrape in the constructor (__init__ function). This enables me to add more things by command line, also it allows me — if I want — to treat the urls a bit before starting the scraper.
  • Function parse: extracts the elements that you want from the webpage (html content like: title, excerpt text).

The other folder of the project (https://gitlab.com/fparareda/scrapy_simple_project) contains the configuration of:

  • The structure of the scraper, settings.py where everything is defined (Concurrent requests, delay in every request, User Agents, Usage of the robots.txt, etc)
  • The item of the scraper, where the information is stored iteratively during the whole process of scraping. The item is defined in items.py
  • The pipelines to use once the information is gotten: pipelines.py

Get data

In the specific scraper (in the project it is called wikipedia.py) you can specify how to get information. To do it, you can use 2 methods (as Scrapy define in their website):

  • XPath is a language for selecting nodes in XML documents, which can also be used with HTML (the best way to learn to use XPath is to practice + StackOverflow)
  • CSS is a language used for applying styles to HTML documents. It defines selectors to associate those styles with specific HTML elements (exactly the same as the old JQuery library in Javascript — I am feeling so old to name it — ).

Login

Several times, you need login into the website to retrieve information. For instance, a lot of forums, where you can get really good information to scrape, you need to be logged first. To do that, you can use the login documentation to simulate the login in the bot.

If you want to do it in Python (with Scrapy), you can follow my example.

Once you have logged in properly, you can enjoy all the information!

Countermeasures (retries, delays and multi threads)

It is very common to use countermeasure to avoid being blocked by defensive rules set up by the webmasters. They are using pre-installed packages in their configurations to avoid getting information automatically. That is why we need to take action in several ways. I am going to review the most common ones:

  • Lack of headers (specifically User Agents): if you want to be identified as a human when you are trying to get the information, you need to add in the request a parameter specifying that you are visiting the website from a specific computer, operating system and version of the browser. To do this, in Scrapy you can hardcode these parameters in the settings.py or you can use this middleware or this one. Also, you can set up things like cookies or even other HTTP headers simulating a redirection from Google or another search engine.
  • Multiple retries in a short time on the same website: by default, Scrapy is not doing any retry if the request has failed. You can change it in the settings.py, in order to retry at least 2 times.
  • Parallelization of requests or multithreading: one Scrapy server can execute several requests per second. It means that at the same time, you can execute requests without delay between them. The most common and normal way, is to specify a delay between requests. This is done to mimic the behavior of a human visitor, as it is impossible for him/her to get into two websites at the same time from the same computer. The recommendation is to set up an average rate (sometimes in the robots.txt file of the website can be specified) or at least, a random pick between 2 numbers (i.e. 0.25 and 5 seconds)
  • Pattern behavior in the requests done by one IP: big companies (Amazon, LinkedIn, Twitter, etc.) are using Artificial Intelligence (AI) to detect if the behavior of the requests indicates that they are coming from the same origin (even if the IPs and Headers are different).

The Pattern behavior can be done analyzing the behavior:

  • Every day, at the same time someone is scraping the same pages.
  • If they are receiving, out of the blue, a lot of requests coming from different IPs around the world, when the website it is only available in one language.
  • The velocity and the pages that a group of IPs are trying to get access to are correlated (for instance, when the website has a lot of pagination and it detects that a Chinese IP is asking the page number 34 and a Russian IP is asking — in a short period of time and for the first time- the page 35).

These behaviors are putting these IPs in quarantine in order to analyze them further and not provide to them the information as a precaution.

Minimum Scraper structure

Once you are ready to set up your first scraper, you have to keep in mind that you need to execute the scraper on one computer. There are 2 possibilities: local and external/cloud.

Local

When you are executing your scraper locally, it’s pretty easy to do and to control. All the issues are easy to solve, because it’s a manual process. Also, you have control over your IP, in case you need to use a local database for instance.

The price is free and it’s pretty basic to set up. You can follow an easy tutorial on how to create a dummy scraper in Scrapy here.

Cloud servers

When you want to scale up, the best solution is using a cloud service. When I am saying cloud service I am referring to Amazon Web Services (AWS), Google Cloud, Microsoft Azure, OVH or others.

Based on your needs, they differ in:

  • Price,
  • Scalability,
  • Combining different services.

In my opinion, one of the best solutions ever (if you have a pretty good knowledge on how to set up a complex server) is to rent a server in Hetzner.

The easiest way to set up a server with Scrapy is to install it in the local environment. One the local Scrapy server is able to scrape dozens of websites, you can switch to a cloud storage (SQL, NoSQL, disk or even an S3), also you can deploy the local scraper to the same server or another cloud service provider.

To be continued…

This article is part of the list of articles about Scraping from 0 to hero:

If you have any question, you can comment below or reach me on Twitter or LinkedIn.

--

--