Scrapy: Design

Ankit Lohani
5 min readJun 2, 2018

--

This is in continuation with the last article where I gave a very brief introduction to the Twisted framework. I hope you are now familiar with the basics upon which the Scrapy framework is built. Let us now talk a bit about the Scrapy architecture.

Unlike most of the available tutorials on the Internet, I won’t start directly with building crawlers for you. As Kent’s father told him once — “People fear what they don’t understand and hate what they can’t conquer.” I think we should first look at the basic architecture and then structure our thoughts before we proceed further. Scrapy would else seem to be very difficult to understand and one might take a lot of time to realize its full potential.

Scrapy Architechture

Scrapy’s official documentation beautifully describes the Scrapy flowsheet with following figure —

Figure 1: Data Flow

Let us collect the key terms from this figure — spiders, requests/responses, items, item pipeline, engine, scheduler, downloader and middlewares — 8 terms in total.

I will keep a Scrapy project structure in parallel with this figure to make you realize how things actually flow in the project w.r.t this figure. You can create a Scrapy project with the command —

scrapy startproject tutorial

This will create a project directory —

tutorial/
scrapy.cfg # deploy configuration file
tutorial/ # project's Python module, you'll import
__init__.py
items.py # project items definition file
middlewares.py # project middlewares file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put
__init__.py
sample_spider.py

Now, the first thing we need to keep in mind is — sample_spider.py is the main script where we will be writing our crawler script. All other files contain supporting classes that assist the Scrapy engine while crawling. You may not configure them, in such case, Scrapy has already defined middlewares, pipelines, settings, items internally. Configuring these four files will over-ride/add more configurations to the internal structure for your particular project. So, let’s start explaining the dataflow that I brought before you in the beginning assuming that in a fresh scrapy project, we have simple written a class to the sample_spider.py and left the other four scripts as it is.

Step 1 : We have our spider ready. When we instruct the project to start crawling — scrapy crawl sample_spider, before the spider is opened and crawling starts, the core engine starts off with the default — middlewares, settings and pipelines (remember we have not edited the other 4 files).

Step 2 : The spider is opened and based on the input URL, engine starts getting requests from the spider. But, the link is not visited directly, it is pumped to the scheduler first.

Step 3 : While the scheduler keeps getting URLs from the spider (in case we have started with only one URL in the spider, the scheduler recieves only one URL that it passes further), it starts pumping those URLs to the downloader via the engine and the download middleware[input] respectively (We will later understand what this middleware does. For now, assume that it does nothing and just passes the URLs it receives from the engine to the Downlaoder) .

Step 4 : The downloader recieves the request and passes it back to the engine via the download middleware[output] (again, assume it is doing nothing, just passing the HTML code it received). The engine sends this request to Spider back through the Spider middleware[input] (Note that the inital request that came to the engine didn’t go through any spider middleware)

Step 5 : The spider processes the response and sends back the item (that you scraped from the HTML response) and the other URLs (extracted from the response for further crawling) to the engine through the Spider middleware[output].

Step 6 : This time, engine received two things — items and requests. It sends the items to the item pipeline and the requests to scheduler. One of the purposes of the item pipelines could be to store this data locally, while the requests follow Step 3.

I hope you have got an image in your mind about the flow of data in the system. To further illustrate this flow, I am going to use the analogy of water pipes to explain some other crucial points that you might not notice otherwise. The length (L) of the pipe represents the service time taken and the cross-sectional area (A) is throughputs(or say, operations/request per unit time). Thus, volume (V) of the pipe is the total elements in the processing system. In most cases, we are not able to control the service time or the length of the pipe (we cannot make the downloader work faster, it will have to wait for the server to respond and it will take its own time).

Figure 2

The purpose of the above diagram is to find out the bottlenecks in our system. For simple projects, Scrapy takes care of everything. But working on complex projects — like “Buzzbang” will require you to have a clear understanding of the internals. Thus, we try to find out the bottleneck in this system using the above diagram. There are two key points that we can learn from the above diagram —

  1. We should write the spiders and item pipeline (I asked you to forget about this before. But this is where we decide what we are going to do with the data, eg, store it to DB) properly so that we are not introducing any blocking processes here, e.g. making remote server requests and waiting for responses. Rarely, our spiders/pipelines do heavy processing. If this is the case, then our server’s CPU might become the bottleneck. In our case it is not. In fact in most of the cases, we don’t do heavy processing here.
  2. Out of the 8 terms that I introduced earlier —three (spider, items and item pipelines) are covered in previous point, two (engines and middlewares) are controllers of the above pipeline and the remaining three — downloader, scheduler+request/response are the ones which are probable bottlenecks. From the arguments that I presented in the figure, I hope it is clear that in a Scrapy system it is wise to place the bottleneck in the downloader because we will not be able to fix the delay in downloader and processing requests. It will, thus, define how “full” other pipes in the system.

Well, that’s all for this article. We are now aware of the Scrapy internals and how they work. In the next article, we will know the components of Scrapy that will be most useful for us to create a proper project for most of the intermediate level use cases.

--

--

Ankit Lohani

When there‘ s so much that could be done, there’s only so much that you could do!