Using scrapy to create a generic and scalable crawling framework

Image for post
Image for post

A simple framework which can scale to crawling multiple websites without having to make changes in the code regularly.

Requisites:
1. Scrapy
2. Scrapyd
3.
Kafka

We’ll go through the process step-by-step to understand the underlying reasons behind doing things a certain way and build up to the final product.
Also, we’ll try to solve the bigger problem (crawling websites regularly) which will finally result into solving other small problems (scraping a single page, extracting links from a page, etc) automatically.

Logical devision:
Let’s think of crawling a website as a three step process -
a) URL extraction: First we’ll try to get those pages (links) of a website which have the content we are trying to scrape.
b) Content scraping: Then we would use the url found to extract data out of this page.
c) Post extraction pipeline: Once we have some data, we would like to do something with it (store it in a db, publish it on a channel, etc)

URL extraction:
Create a spider using scrapy’s built in Spider and LinkExtractor to extract links. The only thing we’ll do differently here is to take a parameter as root at runtime, which will be the url of the source page we need to start extracting links from.
We also take a depth parameter for cases where we need to extract links for depth > 0 (this will be optional with default 0).
Apart from these, we can optionally get values for the parameters of LinkExtractor so that we can configure it on runtime.

from scrapy.spiders import Spider
from scrapy import Request
from scrapy.linkextractors import LinkExtractor

class UrlExtractor(Spider):
name = 'url-extractor'
start_urls = []

def __init__(self, root=None, depth=0, *args, **kwargs):
self.logger.info("[LE] Source: %s Depth: %s Kwargs: %s", root, depth, kwargs)
self.source = root
self.options = kwargs
self.depth = depth
UrlExtractor.start_urls.append(root)
UrlExtractor.allowed_domains = [self.options.get('allow_domains')]
self.clean_options()
self.le = LinkExtractor(allow=self.options.get('allow'), deny=self.options.get('deny'),
allow_domains=self.options.get('allow_domains'),
deny_domains=self.options.get('deny_domains'),
restrict_xpaths=self.options.get('restrict_xpaths'),
canonicalize=False,
unique=True, process_value=None, deny_extensions=None,
restrict_css=self.options.get('restrict_css'),
strip=True)
super(UrlExtractor, self).__init__(*args, **kwargs)

def start_requests(self, *args, **kwargs):
yield Request('%s' % self.source, callback=self.parse_req)

def parse_req(self, response):
all_urls = []
if int(response.meta['depth']) <= int(self.depth):
all_urls = self.get_all_links(response)
for url in all_urls:
yield Request('%s' % url, callback=self.parse_req)
if len(all_urls) > 0:
for url in all_urls:
yield dict(link=url, meta=dict(source=self.source, depth=response.meta['depth']))

def get_all_links(self, response):
links = self.le.extract_links(response)
str_links = []
for link in links:
str_links.append(link.url)
return str_links

def clean_options(self):
allowed_options = ['allow', 'deny', 'allow_domains', 'deny_domains', 'restrict_xpaths', 'restrict_css']
for key in allowed_options:
if self.options.get(key, None) is None:
self.options[key] = []
else:
self.options[key] = self.options.get(key).split(',')

This will allow us to extract links from a website at runtime.
Ex: If you needed to extract only the links or articles (ignoring photo stories and photo-galleries) from the IndianExpress website, you could just trigger this spider from your project as

scrapy crawl url-extractor -a root=http://indianexpress.com/ -a allow_domains="indianexpress.com" -a depth=0 -a allow="/article/"

Content Scraping:
Now we’ll create a spider which can scrape the content of given url.
We want such a scraper where we could pass the config on runtime and get the result scraped in a dictionary of the labels we’d want finally.
For this example we’ll scrape data from the page using css selectors (you can use xpath or even keep this runtime configurable).

import json
import re
import scrapy


class Scraper(scrapy.spiders.Spider):
name = 'scraper'

def __init__(self, page=None, config=None, mandatory=None, *args, **kwargs):
self.page =page
self.config = json.loads(config)
self.mandatory_fields = mandatory.split(',')
super(Scraper, self).__init__(*args, **kwargs)

def start_requests(self):
self.logger.info('Start url: %s' % self.page)
yield scrapy.Request(url=self.page, callback=self.parse)

def parse(self, response):
item = dict(url=response.url)
# iterate over all keys in config and extract value for each of them
for key in self.config:
# extract the data for the key from the html response
res = response.css(self.config[key]).extract()
# if the label is any type of url then make sure we have an absolute url instead of a relative one
if bool(re.search('url', key.lower())):
res = self.get_absolute_url(response, res)
item[key] = ' '.join(elem for elem in res).strip()

# ensure that all mandatory fields are present, else discard this scrape
mandatory_fileds_present = True
for
key in self.mandatory_fields:
if not item[key]:
mandatory_fileds_present = False

if
mandatory_fileds_present:
yield dict(data=item)

@staticmethod
def get_absolute_url(response, urls):
final_url = []
for url in urls:
if not bool(re.match('^http', url)):
final_url.append(response.urljoin(url))
else:
final_url.append(url)
return final_url

Let’s test this on a TOI article.
When we run this by specifying the article link and the config to extract data like this:

scrapy crawl scraper -a page='https://timesofindia.indiatimes.com/city/delhi/2014-khirki-extn-raid-court-orders-aaps-somnath-bharti-to-stand-trial/articleshow/64810526.cms' -a config='{"title":".heading1 arttitle::text","tags":"meta[itemprop=\"keywords\"]::attr(content)","publishedTs":"meta[itemprop=\"datePublished\"]::attr(content)","titleImageUrl":"link[itemprop=\"thumbnailUrl\"]::attr(href)","body":".Normal::text","siteBreadCrumb":"span[itemprop=\"name\"]::text"}' -a mandatory='title'

We get a dictionary as output like this:

{
"title": "2014 Khirki Extension raid: Court orders AAP\u2019s Somnath Bharti to stand trial",
"url": "https://timesofindia.indiatimes.com/city/delhi/2014-khirki-extn-raid-court-orders-aaps-somnath-bharti-to-stand-trial/articleshow/64810526.cms",
"titleImageUrl": "https://static.toiimg.com/thumb/msid-64810525,width-1070,height-580,imgsize-1103101,resizemode-6,overlay-toi_sw,pt-32,y_pad-40/photo.jpg",
"tags": "Latest News,Live News,2014 Khirki Extension raid,Somnath Bharti,MLA,minister,Malviya Nagar,Khirki Extension,bharti,AAP,AAP\u2019s Somnath Bharti",
"publishedTs": "2018-07-01T07:40:47+05:30",
"siteBreadCrumb": "News City News Delhi News Politics",
"body": "NEW DELHI: In fresh trouble for former Delhi law and MLA , a Delhi court on Saturday asked him to face trial in connection with a \n \n\n \n The court brushed aside Bharti\u2019s claim of unfair police probe and ordered framing of charges against him and several other accused for the offences ranging from molestation, house trespass, criminal intimidation etc. Since some of these offences fall in the category of crimes against women, these are non-bailable.\n \n While framing charges additional chief metropolitan magistrate Samar Vishal in his order noted that \u201cBy no stretch of imagination it can be assumed that whatever offences are alleged to have been done by Bharti can be said to have been done in the discharge of his official duties. I am unable to understand what official duty prompted him to assault the helpless women of foreign origin at around 1am.\u201d\n \n Apart from , 16 others have been booked in the case after the Malviya Nagar allegedly barged into the homes of nine Ugandan nationals in , along with some followers, on the intervening night of January 15 and 16.\n \n The court ordered framing of charges against Bharti and others under Sections 147/149 (rioting), 354 (molestation), 354C (voyeurism), 342 (wrongful confinement), 506 (criminal intimidation), 143 (unlawful assembly), 509 (outraging a woman's modesty), 153A (promoting enmity between two groups or religions), 323 (assault), 452 (house trespass), 427 (criminal trespass) and 186 (obstructing public servant in discharge of public functions) of the IPC.\n \n In the order, the magistrate added that there was sufficient evidence that some of these women were beaten and were caused simple hurt punishable under Section 323 of the IPC. \u201cSome of these women have alleged that they were assaulted and their modesty was outraged, they were forced to urinate in front of the mob and therefore the mob has committed an offence under sections 354 and 354C of IPC.\u201d\n \n In his defence, Bharti had claimed he received a string of complaints from residents of the area that a drugs and prostitution ring was being run by the Ugandan nationals. However, investigations by police revealed that no drugs were recovered that night. During the incident, Bharti also had an altercation with the cops on the issue.\n \n In its chargesheet, the police cited around 41 prosecution witnesses, including nine African women, to buttress the charges levelled following investigations into the FIR lodged on January 19, 2014 against \u201cunknown accused\u201d on the court\u2019s direction and booked them for various charges."
}

Post extraction pipeline:

In your pipeline, you can chose to do anything with the extracted information.
You could store it in a database, do some post-processing or write it to a kafka topic.
You just need to edit the pipelines.py file according to your needs.
For further details see link.

Why scrapyd?

Scrapyd enables us to control our spiders using json api.
Which basically means that instead of having to trigger a spider as

scrapy crawl url-extractor -a root=http://indianexpress.com/ -a allow_domains="indianexpress.com" -a depth=0 -a allow="/article/"

you could make a http call to an api as

curl http://localhost/schedule.json -d project=default -d spider=url-extractor -d root=http://indianexpress.com/ -d allow_domains="indianexpress.com" -d depth=0 -d allow="/article/"

Bringing it all together:

Now that you have all your spiders setup, you’ll need an Orchestrator to bring it all together for you.
The orchestrator would be a program that simply triggers the url-extraction for a site -> reads the result produced by the extractor (from a kafka topic) ->triggers page-scraper for all these links.

It could also hold the configs for url-extractor and page-scraper for sites centrally along with a scheduler.

Image for post
Image for post

Why all the separation?

Keeping all these elements separately allows you to scale this model indefinitely.
You can now have multiple url-extractors and page-scrapers running simultaneously and a single orchestrator should be enough to control an enormous amount of scraping.
As all of these individual elements are sharing data through kafka, it makes it an async process.

Written by

An enthusiastic learner

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store