Scrapy: Creating Crawlers

Ankit Lohani
6 min readJun 4, 2018

--

In the previous article, we understood the internals of Scrapy. Now that we have a better picture of what’s going on behind the screen, it’s time we start learning various aspects of Scrapy. I have come up with a concept map that enlists the important concepts of Scrapy and the order in which one may go about learning them. You can see that in Figure 1. For the rest of this articles and the one’s I will be publishing further, I will be following this concept map for the order of the content.

Figure 1 —Scrapy Concept Map

I have already shown the structure of a default Scrapy project directory. There are 5 files that are of interest to us — items.py, middlewares.py, pipelines.py, settings.py and sample_scraper.py. The first three files contains sample classes of respective modules that are not called anywhere are we need to update them before use. The settings.py file contains some default settings that are in use when we create a new project. If we change any setting in this file, it will simply over-write the settings behind this layer (Yes! in-fact Scrapy settings are provided in 5 layer that we will discuss later). Finally, we have the sample_spider.py file where we will write the main crawling script.

I won’t explain the basic parts in great detail because there are already tonnes of tutorials available out there. I would rather focus on how you can go about learning them and some tricky stuffs inside it.

Spiders

Let recap what we learned in the last article about spiders. Spiders make the initial request to the engine. It receives a response to the request that it sends, and extracts further links to be crawled and items of interest from it. We will now see how these three things are written down in a class.

Figure 2 — Spider Concept Map

Scrapy has a template for writing generic spiders — spider.Spider. Other four spiders are special instances of this class for specific use cases and is made available for our convenience. Let us understand the vanilla spider.Spider class that we are going to write in sample_scrapy.py file. I will take a bit complicated example to explain all basic concepts in one go.

Problem Statement — I have to collect data of all biosamples available on this website, this URL is my start URL and the response I get contains the HTML page that you see when you visit this link. As explained in previous article, the two components of interest in this response is — the samples itself (items) and URLs pointing to further pages that we want to crawl (requests). A small challenge here is that, this HTML response does not contain the sample data that we wish to collect. In fact, we want to further visit each sample link per page to collect data. (Note: If you visit each sample, the data is present on that page in JSON-LD format that can be extracted easily with a function from “extruct” library) Let’s see how we do that —

import json
import scrapy
import logging
from extruct.jsonld import JsonLdExtractor

logger = logging.getLogger('paginationlogger')

class BiosamplesPaginationSpider(scrapy.Spider):
name = "pagination"
allowed_domains = ['ebi.ac.uk']
start_urls = ['https://www.ebi.ac.uk/biosamples/samples?start=0',]
#The first request to engine is made from here

#The response received from first request goes directly in this function
def parse(self, response):
base = 'https://www.ebi.ac.uk'
#visit each sample & get the URL taking to its data page
urls = response.xpath("//a[@class='button readmore float-right']/@href").extract()
for url in urls:
request = scrapy.Request(base+url, callback=self.fn)
yield request

# process next page
next_page_url = response.xpath("//li[@class='pagination-next']//a/@href").extract_first()
if next_page_url is not None:
request = scrapy.Request(next_page_url, callback = self.parse)
yield request


def fn(self, response):
jslde = JsonLdExtractor()
jsonld = jslde.extract(response.body)
logger.info("Sample Extracted - %s", json-ld)

As I explained in the architecture how things happen behind the screen step by step — spider makes the initial request to the engine using the start_urls (list dtype) and start_request (function dtype). If you are giving start_urls, the response is automatically parsed in a parse function. But, if you are using start_request function or a individual scrapy.http.Request, then the response has to be handled in a callback function (fn here). “response.body” contains the HTML codes that can be parsed within the callback function using “Selectors”. “name” is simply a class attribute that gives a name to spider so that Scrapy can identify it easily and “allowed_domains” list makes sure that the new requests parsed from the response.body doesn’t take us outside this particular domain. Crawler class methods and settings will be discussed in further articles.

Other spiders are built on the top of this and encapsulates some of these features together to cater different requirements smoothly.

Items, Item loaders and Item pipeline

In the previous article, I talked about item pipelines —an inbuilt module is Scrapy that takes in your data and performs required action (like pushing it to a DB). The coolest thing about it is that it is weaved asynchronously with the scheduler and downloader. So, while these two modules are busy in their work, the item pipeline does its work. Now, how are these items taken to the pipelines? Scrapy has created a vessel for the items. As the engine finds out that the vessel is being filled, it carries it to the pipelines. These vessels are our “items”. Item loaders is simply a mechanism to fill these vessels. You should actually not worry much about this mechanism and you can populate your items directly in the spider. The first step is to create these items —

class BioschemasScraperItem(scrapy.Item):
jsonld = scrapy.Field()

Put this class in the sample_spider.py file and use it further or follow the standard Scrapy way of declaring this in the items.py file and call it in your spider. Next, instead of just logging the items, update your parse fn to fill the item vessels —

def fn(self, response):
jslde = JsonLdExtractor()
jsonld = jslde.extract(response.body)
item = BioschemasScraperItem()
item['jsonld'] = jsonld
logger.info("Sample Extracted - %s", response.url)
yield item

Exceptions

There are six built-in exceptions that Scrapy has provided (although you can define your own, but these would be enough for most of the use cases). We can call them anywhere in the project — in any middlewares, pipelines and spiders. One example that I used in my project is to raise a DropItem exception when the scraped HTML page has no json-ld data (Who wants to push an empty document to a DB?). Simpl, call it in the beginning of the Spider and use it as —

from scrapy.exceptions import DropItem
def fn(self, response):
jslde = JsonLdExtractor()
jsonld = jslde.extract(response.body)
item = BioschemasScraperItem()
item['jsonld'] = jsonld
for data in item:
if not data:
raise DropItem("Missing data!")
logger.info("Sample Extracted - %s", response.url)
yield item

Essential Settings

I have talked about the settings that Scrapy provides us with. There are some basic settings that you need to know for now,while other settings will be discussed as and when required. There are five layers in which settings in a Scrapy projects can be given. You can read about them here. Let’s spend some time analyzing these settings. Here is a concept map that segregates different types of settings and we will talk about the important and basic ones only.

Figure 3 — Basic settings concept map

I won’t be explaining these settings as they have been well explained in the documentation. However, this might have given you some idea about the order in which you should start looking at things in Scrapy include them one by one in your project to make it more stable and stronger.

We will now move forward to some really great concepts of Scrapy and will take it slow from now so that can grasp those topics well.

Until next time!

--

--

Ankit Lohani

When there‘ s so much that could be done, there’s only so much that you could do!