Scrapy: Background

Ankit Lohani
7 min readJun 2, 2018

--

“The only way to know how strong you are is to keep testing your limits.” — Jor EL to his son. I think it is true for Scrapy as well. I have been working on my Google Summer of Codes Project — Buzzbang for the past two weeks and the first part that I have to implement it web scraping. I have been reading a lot about it — docs, stackoverflow, reddits and every source available out there and I can confidently conclude that it is the Superman of the scraping world. Through this article, I want to share what I learnt about this Kryptonian kid.

Introduction to the Twisted Framework

In layman terms — Scrapy is basically a web crawling framework, but unlike most other tools available out there, it is built using the Twisted Python framework — the secret behind it’s super powers. Before taking it any further, I will take you through what I called a “Twisted Framework”. Even before this, let us have a quick look on how asynchronous systems work.

Asynchronous Programming

Let’s say a program has to complete three tasks. In a single single-threaded synchronous model, each task is performed one at a time sequentially and we can assume that all earlier tasks have finished without errors, with all their output available for use — a definite simplification in logic. This could be well represented with the following Figure 1

Figure 1

In contrast to this, a multi-threaded synchronous model is where each task is performed in a separate thread of control. The threads are managed by the OS and may run truly concurrently, or may be interleaved together on a single processor, i.e, the details of execution are handled by the OS and the program is written simply in terms of independent instruction streams which may run simultaneously. Thus, thread communication and coordination is difficult to understand and implement. Figure 2 well represents such type of systems.

Figure 2

Some programs implement parallelism using multiple processes instead of multiple threads. In such a model, the tasks are interleaved with one another, but in a single thread of control. Thus, only one task is executing at a time with interleaving, even on a multi-processor system. Figure 3 represents such single threaded asynchronous systems.

Figure 3

This framework works way faster than the synchronous ones mainly because when some tasks are forced to wait or blocked, others continue to be executed and make progress. It allows us to make a call without waiting for that call to complete. But instead of using threads to accomplish this, it uses callbacks. Callbacks (for now, think of this as another python function) are typically much faster than threads because there is no overhead for context switching, and we also don’t have the cost of periodically visiting threads to see if something has changed. A client server is a typical example where this model fits well as they often take time to process requests and respond. So a network server implementation is a prime candidate for the asynchronous model. Twisted is first and foremost a networking library. It is a highly abstracted system giving us tremendous leverage when we can use it to solve problems such problems.

Twisted Framework

Now that we have little idea about the asynchronous frameworks, let’s delve deeper into the Twisted Framework. Twisted is an implementation of the Reactor Pattern. In a regular program, the program makes things happen. In an event-driven system like Twisted, the program sits and waits for things (events) to happen, and then it responds to them. Thus, we don’t write a typical main() that drives things; instead, we hand over control to something that watches for events. Whenever a signal is triggered a callback function is called which handle our computation. Twisted Frameworks heavily uses callbacks to work for everything! It is worth noting that our callback code runs in the same thread as the Twisted loop. and as we can see in the Figure 4, when our callbacks are running, the Twisted loop is not running and vice versa. During a callback, the Twisted loop is effectively “blocked” on our code. So we should make sure our callback code doesn’t waste any time. In particular, we should avoid making blocking I/O calls in our callbacks. Otherwise, we would be defeating the whole point of using the reactor pattern in the first place.

Figure 4 shows what happens during a callback:

I am not providing any sample codes in this article as the aim of this article is to make you familiar with the background of Scrapy and get you familiar with a few basic concepts upon which it works. We will take examples when our scraper flies out to save its world.

The Twisted Framework composed of layers of abstractions — APIs and Interfaces. Reactor is one such abstraction and probably the most important one. This is because everything happening inside is happening over a reactor loop spinning around. But this is happening in a lower level of abstraction and the library takes care of it. Any twisted program, in fact Scrapy that is bilt over it, uses the APIs to get most of its work done. The APIs we write for Twisted programs will have to be asynchronous and should not be mixed with the synchronous part. Thus, we will have to use callbacks in our code (as you will see in the Scrapy examples) for performing tasks and handling error.

Callbacks & Errbacks

Let’s have a little more on the Callbacks since we are going to be using a lot in our Scrapy programs. When we run normal python scripts and we add a try.. except block, it basically runs the except block when the interpreter encounters a particular error in the try block. The “error callback” (aka errback) function in twisted programming is not the same. The errback is invoked by our code and we are in-charge to make sure that the error code runs when somethings goes wrong. We have to “raise” our asynchronous exception, otherwise our program will just run forever, blissfully unaware that anything is amiss or simply said — just waiting for a callback that never comes. Thus, we have to make sure to handle every possible error case by invoking the errback with a Failure object.

The Deferred

Callbacks get tricky most of the times and we often have to take care that we don’t call them in the wrong time. For a typical use case, the callback and errback are mutually exclusive and invoked exactly once. the Twisted developers created an abstraction called a Deferred to make programming with callbacks easier. A deferred contains a pair of callback chains, one for normal results and one for errors. A newly-created deferred has two empty chains. We can populate the chains by adding callbacks and errbacks and then fire the deferred with either a normal result or an exception. Firing the deferred will invoke the appropriate callbacks or errbacks in the order they were added. Figure 5 illustrates a deferred instance with its callback/errback chains:

Figure 5: A Deferred

Deferreds allows us to create follow up action for something that will take some time to get fulfilled. This in turn relieves twisted to attend to other tasks and come back to execute follow-up actions when the condition is completed. You actually tell a deferred (call it the object which represents a promise of something) what to do when data is returned and you do this by generating a deferred, and then adding callbacks onto it. Remember, Deferreds don’t use reactor. There’s nothing asynchronous going on at all. There can’t be, since no reactor is running. It really boils down to an ordinary Python function call.

Deferreds help us avoid one of the pitfalls we identified with callback programming. When we use a deferred to manage our callbacks, we simply can’t make the mistake of calling both the callback and the errback at once, or invoking the callback multiple times! We can try, but the deferred will raise an exception right back at us, instead of passing our mistake onto the callbacks themselves. Since Deferreds can only be fired once, it makes them similar to the familiar semantics of try/except statements.

Well, that’s a lot to digest for now. But we will come back to these concepts when I explain Scrapy in detail in the next article.

--

--

Ankit Lohani

When there‘ s so much that could be done, there’s only so much that you could do!