Develop a web crawler in 90 seconds

Marvin Zhang
8 min readFeb 5, 2020

--

Background

Scraping on the web is an interesting thing, which allows you to collect web data through automated spider programs and it reduces a lot of manual work. Before the emergence of good spider frameworks, developers had to use simple HTTP requests plus web page parsing to develop spider programs, such as Python requests + BeautifulSoup. Data storage modules such as MySQL, MongoDB are integrated into more advanced spider programs. But because of the development inefficiencies and lack of robustness, building a comprehensive and production-ready crawler may take a couple of hours. I name this kind of web crawlers as Non-framework Crawlers.

In 2011, the Twisted-based web crawler framework named Scrapy become aware by the public and also the “second-to-none” high-performance asynchronous crawler framework. Scrapy has abstracted several core modules, which allows developers to focus on data extraction, rather than some fussy modules such as data downloading, page parsing and task coordination. Developing a production-ready Scrapy spider only takes about ten-ish minutes, or up to one hour with complicated requirements. There are also many good frameworks such as PySpider and Colly. I would call those spiders Framework Spiders. The Framework Spiders have released the productivity, and they are in corporate use in production environment with modification to crawl at a large scale.

However, for those who would like to crawl hundreds of sites, the Framework Spiders could be out of their capability. And developing spiders becomes manual work. For example, if the average time for developing a Framework Spider takes 20 minutes and a full-time spider developer works 8 hours per day, developing 1,000 sites would require 20,000 minutes, 333 hours and 42 workdays. Yet we can employ 10 full-time spider developers, but this would take 4 workdays to complete (as in the figure below).

This is again very inefficient. To overcome this efficiency issue, Configurable Spiders are created.

Introduction of Configurable Spiders

Configurable Spider as its literal meaning, is a spider which allows users to crawl data based on configured crawling rules. A Configurable Spider is a highly-abstracted spider program. Developers don’t have to write any spider code. Instead, they only need to configure site URLs, data fields and data attributes in the config files or database, read by a special spider program which would crawl web data accordingly. The Configurable Spiders abstracted the spider code into config info, which streamlined the spider development process. In this way, spider developers only need to make corresponding configs to complete a spider’s development. Therefore, developers would be able to use the Configurable Spiders to write large-scale spiders (as in the figure below).

This method make it possible for crawling hundreds of sites. A proficient spider configurer can configure 1,000 news site spider. This is very important to companies who require public sentiment monitoring, because the Configurable Spiders have increased productivity, reduced the unit working time cost and improved development efficiency, which makes it convenient for sentiment analysis and AI product development. Many companies are developing their own Configurable Spiders (the namings might be different, but actually the same thing) and have employed spider configurers to focus on configuring spiders.

There are not many open-source Configurable Spider frameworks available on the market. Earlier there is a framework named Gerapy developed by the spider guru Germy Cui who is working in Microsoft. It is a crawler admin platform, which is able to generate Scrapy project files based on config rules. A newer Configurable Spider framework is Crawlab, though it is a highly flexible spider admin platform. In v0.4.0, Crawlab released its Configurable Spider. Yet there is another framework named Ferret based on Golang. It is quite interesting as Ferret makes developing spiders as easy as writing SQL statements. There are some other commercial products, but they seem not professional enough in production use according to the user feedbacks.

The emergence of Configurable Spiders is mainly due to the simple patterns of web crawlers. It is basically the combination of list pages and detail pages (as in the figure below), or simply list pages. There are certainly some other more complicated general spiders, which could also being implemented through rules configuring.

Crawlab Configurable Spiders

What we are introducing today is the Configurable Spider in Crawlab. In this article the author has briefly introduced the main functionalities of Crawlab, but the Configurable Spider had not yet been developed by that time. In this article, we are going to focus on the practice of Crawlab Configurable Spiders. If you are not familiar with Crawlab’s Configurable Spiders, please refer to the documentation (Chinese).

Configurable Spiders Practice

All example spiders are configured by the author through the Configurable Spider on the Crawlab Official Demo Platform. Domains including news, finance, auto, books, video, search engine and developer community. We are going to introduce a few of them. All examples are available on the Official Demo Platform and you can sign-up to check them out.

Baidu (search “Crawlab”)

URL:http://crawlab.cn/demo#/spiders/5e27d055b8f9c90019f42a83

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: http://www.baidu.com/s?wd=crawlab
start_stage: list
stages:
- name: list
is_list: true
list_css: ""
list_xpath: //*[contains(@class, "c-container")]
page_css: ""
page_xpath: //*[@id="page"]//a[@class="n"][last()]
page_attr: href
fields:
- name: title
css: ""
xpath: .//h3/a
attr: ""
next_stage: ""
remark: ""
- name: url
css: ""
xpath: .//h3/a
attr: href
next_stage: ""
remark: ""
- name: abstract
css: ""
xpath: .//*[@class="c-abstract"]
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

SegmentFault(Newest Articles)

URL:http://crawlab.cn/demo#/spiders/5e27d116b8f9c90019f42a87

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://segmentfault.com/newest
start_stage: list
stages:
- name: list
is_list: true
list_css: .news-list > .news-item
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: title
css: h4.news__item-title
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: .news-img
xpath: ""
attr: href
next_stage: ""
remark: ""
- name: abstract
css: .article-excerpt
xpath: ""
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

Amazon China(search “phone”)

URL:http://crawlab.cn/demo#/spiders/5e27e157b8f9c90019f42afb

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_2
start_stage: list
stages:
- name: list
is_list: true
list_css: .s-result-item
list_xpath: ""
page_css: .a-last > a
page_xpath: ""
page_attr: href
fields:
- name: title
css: span.a-text-normal
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: .a-link-normal
xpath: ""
attr: href
next_stage: ""
remark: ""
- name: price
css: ""
xpath: .//*[@class="a-price-whole"]
attr: ""
next_stage: ""
remark: ""
- name: price_fraction
css: ""
xpath: .//*[@class="a-price-fraction"]
attr: ""
next_stage: ""
remark: ""
- name: img
css: .s-image-square-aspect > img
xpath: ""
attr: src
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

V2ex

URL:http://crawlab.cn/demo#/spiders/5e27dd67b8f9c90019f42ad9

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://v2ex.com/
start_stage: list
stages:
- name: list
is_list: true
list_css: .cell.item
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: href
fields:
- name: title
css: a.topic-link
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: a.topic-link
xpath: ""
attr: href
next_stage: detail
remark: ""
- name: replies
css: .count_livid
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: detail
is_list: false
list_css: ""
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: content
css: ""
xpath: .//*[@class="markdown_body"]
attr: ""
next_stage: ""
remark: ""
settings:
AUTOTHROTTLE_ENABLED: "true"
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/79.0.3945.117 Safari/537.36

Results

36kr

URL:http://crawlab.cn/demo#/spiders/5e27ec82b8f9c90019f42b59

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://36kr.com/information/web_news
start_stage: list
stages:
- name: list
is_list: true
list_css: .kr-flow-article-item
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: title
css: .article-item-title
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: body
xpath: ""
attr: href
next_stage: detail
remark: ""
- name: abstract
css: body
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: author
css: .kr-flow-bar-author
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: time
css: .kr-flow-bar-time
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: detail
is_list: false
list_css: ""
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: content
css: ""
xpath: .//*[@class="common-width content articleDetailContent kr-rich-text-wrapper"]
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

Summary

The Crawlab Configurable Spider is very convenient, allowing developers to quickly configure the spiders they need. Configuring 11 spiders above took less than 40 minutes (taking into account anti-crawl debugging time), and some easy spiders took 1–2 minutes. No code was written and all work was done on the interface. Furthermore, the Crawlab Configurable Spiders support configurations not only on the web interface, but also in Spiderfile in Yaml format. Actually, all configs are able to be mapped to Spiderfile. The Crawlab Configurable Spider is based on Scrapy, therefore it supports most features of Scrapy. You can use Setting to extend the spider settings such as USER_AGENT, ROBOTSTXT_OBEY, etc.

Why shall we select the Crawlab Configurable Spider as the primary option? Because they are not only able to be configured, but also share all core features of Crawlab, such as task coordination, cron jobs, log management, notifications, etc. In the future plan, Crawlab Dev Team would improve the Configurable Spiders, allowing them to support more features, such as dynamic content, more engines, realization of CrawlSpider, etc.

Reference

If you think Crawlab is helpful in your daily development or companies, please add the author’s Wechat account “tikazyq1” and noting “Crawlab”. The author will add you in the discussion group. Welcome to star our project on Github. If you have any question, feel free to submit issues on Github, and also welcome to make contributions.

Note: the example in the figures are displayed in Chinese, yet you can still go the Demo (English is available) to view them in detail. Please feel free to contact me on Github ;)

--

--