At nam.R, we gather publicly available, non-personal data and use our AI tools to build a Digital Twin of France. In that regard, we need a robust and scalable scraping infrastructure. Until recently, we were using Scrapy but, reaching its limits, we started developing our own solution, based on Celery.
Distributed web scraper : Scraper, Scraping, Scrapy
A scraper is a program whose goal is to extract data by automated means from a format not intended to be machine-readable, such as a screenshot or a formatted web page. Scripting languages are mostly used for scraping, and in Python the most well-known library is Scrapy.
Scrapy describes itself as “An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.” and if your goal is exactly that “extracting data from website” and nothing more, Scrapy may be your best option. In a few lines of codes, you can have a working scraper.
You will reach its limits if you try to do something more complex, but you should not reinvent the wheel and if Scrapy works for you, you can even look at Frontera to scale your scrapers.
Why develop a custom solution?
Let’s use an example, you need to retrieve the nearest and best-rated sushi restaurants around millions of addresses, but the website from where you want to scrap this information only let you list every restaurants in a city. You will have to:
- Scrap the location and rating of every restaurants in the cities present in your list of addresses
- Compute the nearest and best-rated restaurants
- Download the complete information for the selected restaurants (we don’t want to download everything on the first step as it would take too much time, disk space).
The first step can be easily achieved with Scrapy. You’ll hit the first obstacle when trying to do the computation. You could try to hack it in the middle of a Scrapy spider but trying to match millions of addresses with thousand of restaurants based on geospatial computations in Python is going to take a lot of time and any error means you’ll lose everything as scraped data do not persist between Scrapy runs. Worse, if you chose to use Frontera, you also need to synchronize scraped data across multiple instances.
At this point, it seems easier to build ”from scratch” a custom distributed web scraper from scratch than trying to hack your needs into existing products.
We need our scraper to be resilient because webpages are always changing, servers can crash, be overloaded and you don’t want a random error to halt the whole scraping. After the first scraping, we want to reuse the scraped data to speed up the following run. Obviously, it needs to be scalable, easy to monitor and finally we want its code base to be maintainable and fully tested.
The name, scrap.R, wasn’t hard to find as 90% of our projects’ names at nam.R are puns on the name of the company. We chose to base our scraping infrastructure on Celery, a Python library made to manage a distributed queue of tasks. It is mostly used to execute asynchronous tasks alongside a web server, like sending emails for example, but it’s a very powerful and versatile tool. We are using Celery to divide the scraping into smaller independent tasks that can be executed in parallel.
A single Celery instance is able to process millions of tasks a minute, meaning Celery won’t become a bottleneck. A PostgreSQL database is used to centralize the scraped data, meaning we cannot lose more than the result of a single failed task. Every scraper follows the same logic with 3 types of tasks: workers, downloaders and scrapers.
The workers are the entry point into the scraping process, their role is to send the initial requests. When scraping a paginated list, the worker may only query the first page at first but on following runs, he may use the previously scraped data to start at the last known page and send requests to directly update every known items.
The downloaders simply take requests and return responses, nothing fancy, but multiple downloaders exist to support multiple protocols (mainly http(s) and (s)ftp).
The scrapers’ role is to read the response, extract the data we’re interested in and send more requests if needed. Following the example of a paginated list, a scraper would extract the items of the current page and send a request for the next page.
This infrastructure can support very basic as well as very complex logics. Another strength is the ability to leverage the power of a PostgreSQL database. Using PostGIS, a PostgreSQL plugin made to handle geospatial data, we can solve our computation issue in the sushi restaurant example by computing the nearest/best-rated restaurants with a single SQL query.
The limited number of tasks’ types also helps with the maintainability of the code as it enforces a specific structure. When we need to scrap a new source, we can quickly create new workers and scrapers as needed. We also keep a code coverage of 100%, which can be bothersome at first but helps catch bug early.
We administer our Celery cluster with a web-based interface named Flower. It’s useful both during development and in production to track failed tasks and retrieve their stacktrace. Celery Beat is used to execute workers periodically and the schedule can be accessed and edited via an API.
We are still actively developing scrap.R, adding new scrapers, improving the performances and errors handling. After scraping, we also use Celery to analyze and send scraped data into our internal datasets database: the Data Library. Celery is a versatile tool that can be used for a variety of tasks, it fits well the need of a distributed web scraper and using a lower-level library, compared to Scrapy, let’s us have more control.
Scraping is the first step in almost everything we do at nam.R, it is how we can continuously integrate up-to-date data into our machine learning algorithms to improve our Digital Twin of France.