Develop a web crawler in 90 seconds

8 min readFeb 5, 2020

Background

Scraping on the web is an interesting thing, which allows you to collect web data through automated spider programs and it reduces a lot of manual work. Before the emergence of good spider frameworks, developers had to use simple HTTP requests plus web page parsing to develop spider programs, such as Python requests + BeautifulSoup. Data storage modules such as MySQL, MongoDB are integrated into more advanced spider programs. But because of the development inefficiencies and lack of robustness, building a comprehensive and production-ready crawler may take a couple of hours. I name this kind of web crawlers as Non-framework Crawlers.

In 2011, the Twisted-based web crawler framework named Scrapy become aware by the public and also the “second-to-none” high-performance asynchronous crawler framework. Scrapy has abstracted several core modules, which allows developers to focus on data extraction, rather than some fussy modules such as data downloading, page parsing and task coordination. Developing a production-ready Scrapy spider only takes about ten-ish minutes, or up to one hour with complicated requirements. There are also many good frameworks such as PySpider and Colly. I would call those spiders Framework Spiders. The Framework Spiders have released the productivity, and they are in corporate use in production environment with modification to crawl at a large scale.

However, for those who would like to crawl hundreds of sites, the Framework Spiders could be out of their capability. And developing spiders becomes manual work. For example, if the average time for developing a Framework Spider takes 20 minutes and a full-time spider developer works 8 hours per day, developing 1,000 sites would require 20,000 minutes, 333 hours and 42 workdays. Yet we can employ 10 full-time spider developers, but this would take 4 workdays to complete (as in the figure below).

This is again very inefficient. To overcome this efficiency issue, Configurable Spiders are created.

Introduction of Configurable Spiders

Configurable Spider as its literal meaning, is a spider which allows users to crawl data based on configured crawling rules. A Configurable Spider is a highly-abstracted spider program. Developers don’t have to write any spider code. Instead, they only need to configure site URLs, data fields and data attributes in the config files or database, read by a special spider program which would crawl web data accordingly. The Configurable Spiders abstracted the spider code into config info, which streamlined the spider development process. In this way, spider developers only need to make corresponding configs to complete a spider’s development. Therefore, developers would be able to use the Configurable Spiders to write large-scale spiders (as in the figure below).

This method make it possible for crawling hundreds of sites. A proficient spider configurer can configure 1,000 news site spider. This is very important to companies who require public sentiment monitoring, because the Configurable Spiders have increased productivity, reduced the unit working time cost and improved development efficiency, which makes it convenient for sentiment analysis and AI product development. Many companies are developing their own Configurable Spiders (the namings might be different, but actually the same thing) and have employed spider configurers to focus on configuring spiders.

There are not many open-source Configurable Spider frameworks available on the market. Earlier there is a framework named Gerapy developed by the spider guru Germy Cui who is working in Microsoft. It is a crawler admin platform, which is able to generate Scrapy project files based on config rules. A newer Configurable Spider framework is Crawlab, though it is a highly flexible spider admin platform. In v0.4.0, Crawlab released its Configurable Spider. Yet there is another framework named Ferret based on Golang. It is quite interesting as Ferret makes developing spiders as easy as writing SQL statements. There are some other commercial products, but they seem not professional enough in production use according to the user feedbacks.

The emergence of Configurable Spiders is mainly due to the simple patterns of web crawlers. It is basically the combination of list pages and detail pages (as in the figure below), or simply list pages. There are certainly some other more complicated general spiders, which could also being implemented through rules configuring.

Crawlab Configurable Spiders

What we are introducing today is the Configurable Spider in Crawlab. In this article the author has briefly introduced the main functionalities of Crawlab, but the Configurable Spider had not yet been developed by that time. In this article, we are going to focus on the practice of Crawlab Configurable Spiders. If you are not familiar with Crawlab’s Configurable Spiders, please refer to the documentation (Chinese).

Configurable Spiders Practice

All example spiders are configured by the author through the Configurable Spider on the Crawlab Official Demo Platform. Domains including news, finance, auto, books, video, search engine and developer community. We are going to introduce a few of them. All examples are available on the Official Demo Platform and you can sign-up to check them out.

Baidu (search “Crawlab”)

URL：http://crawlab.cn/demo#/spiders/5e27d055b8f9c90019f42a83

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: http://www.baidu.com/s?wd=crawlab
start_stage: list
stages:
- name: list
  is_list: true
  list_css: ""
  list_xpath: //*[contains(@class, "c-container")]
  page_css: ""
  page_xpath: //*[@id="page"]//a[@class="n"][last()]
  page_attr: href
  fields:
  - name: title
    css: ""
    xpath: .//h3/a
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: ""
    xpath: .//h3/a
    attr: href
    next_stage: ""
    remark: ""
  - name: abstract
    css: ""
    xpath: .//*[@class="c-abstract"]
    attr: ""
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
    like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

SegmentFault（Newest Articles）

URL：http://crawlab.cn/demo#/spiders/5e27d116b8f9c90019f42a87

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://segmentfault.com/newest
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .news-list > .news-item
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: title
    css: h4.news__item-title
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: .news-img
    xpath: ""
    attr: href
    next_stage: ""
    remark: ""
  - name: abstract
    css: .article-excerpt
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
    like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

Amazon China（search “phone”）

URL：http://crawlab.cn/demo#/spiders/5e27e157b8f9c90019f42afb

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_2
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .s-result-item
  list_xpath: ""
  page_css: .a-last > a
  page_xpath: ""
  page_attr: href
  fields:
  - name: title
    css: span.a-text-normal
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: .a-link-normal
    xpath: ""
    attr: href
    next_stage: ""
    remark: ""
  - name: price
    css: ""
    xpath: .//*[@class="a-price-whole"]
    attr: ""
    next_stage: ""
    remark: ""
  - name: price_fraction
    css: ""
    xpath: .//*[@class="a-price-fraction"]
    attr: ""
    next_stage: ""
    remark: ""
  - name: img
    css: .s-image-square-aspect > img
    xpath: ""
    attr: src
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
    like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

V2ex

URL：http://crawlab.cn/demo#/spiders/5e27dd67b8f9c90019f42ad9

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://v2ex.com/
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .cell.item
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: href
  fields:
  - name: title
    css: a.topic-link
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: a.topic-link
    xpath: ""
    attr: href
    next_stage: detail
    remark: ""
  - name: replies
    css: .count_livid
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
- name: detail
  is_list: false
  list_css: ""
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: content
    css: ""
    xpath: .//*[@class="markdown_body"]
    attr: ""
    next_stage: ""
    remark: ""
settings:
  AUTOTHROTTLE_ENABLED: "true"
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
    like Gecko) Chrome/79.0.3945.117 Safari/537.36

Results

36kr

URL：http://crawlab.cn/demo#/spiders/5e27ec82b8f9c90019f42b59

Configuration

Spiderfile

version: 0.4.4
engine: scrapy
start_url: https://36kr.com/information/web_news
start_stage: list
stages:
- name: list
  is_list: true
  list_css: .kr-flow-article-item
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: title
    css: .article-item-title
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: url
    css: body
    xpath: ""
    attr: href
    next_stage: detail
    remark: ""
  - name: abstract
    css: body
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: author
    css: .kr-flow-bar-author
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
  - name: time
    css: .kr-flow-bar-time
    xpath: ""
    attr: ""
    next_stage: ""
    remark: ""
- name: detail
  is_list: false
  list_css: ""
  list_xpath: ""
  page_css: ""
  page_xpath: ""
  page_attr: ""
  fields:
  - name: content
    css: ""
    xpath: .//*[@class="common-width content articleDetailContent kr-rich-text-wrapper"]
    attr: ""
    next_stage: ""
    remark: ""
settings:
  ROBOTSTXT_OBEY: "false"
  USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
    like Gecko) Chrome/78.0.3904.108 Safari/537.36

Results

Summary

The Crawlab Configurable Spider is very convenient, allowing developers to quickly configure the spiders they need. Configuring 11 spiders above took less than 40 minutes (taking into account anti-crawl debugging time), and some easy spiders took 1–2 minutes. No code was written and all work was done on the interface. Furthermore, the Crawlab Configurable Spiders support configurations not only on the web interface, but also in Spiderfile in Yaml format. Actually, all configs are able to be mapped to Spiderfile. The Crawlab Configurable Spider is based on Scrapy, therefore it supports most features of Scrapy. You can use Setting to extend the spider settings such as USER_AGENT, ROBOTSTXT_OBEY, etc.

Why shall we select the Crawlab Configurable Spider as the primary option? Because they are not only able to be configured, but also share all core features of Crawlab, such as task coordination, cron jobs, log management, notifications, etc. In the future plan, Crawlab Dev Team would improve the Configurable Spiders, allowing them to support more features, such as dynamic content, more engines, realization of CrawlSpider, etc.

Reference

Github: https://github.com/crawlab-team/crawlab
Demo: http://crawlab.cn/demo
Documenation: http://docs.crawlab.cn/

If you think Crawlab is helpful in your daily development or companies, please add the author’s Wechat account “tikazyq1” and noting “Crawlab”. The author will add you in the discussion group. Welcome to star our project on Github. If you have any question, feel free to submit issues on Github, and also welcome to make contributions.

Note: the example in the figures are displayed in Chinese, yet you can still go the Demo (English is available) to view them in detail. Please feel free to contact me on Github ;)

Develop a web crawler in 90 seconds

Background

Introduction of Configurable Spiders

Crawlab Configurable Spiders

Configurable Spiders Practice

Baidu (search “Crawlab”)

Configuration

Spiderfile

Results

SegmentFault（Newest Articles）

Configuration

Spiderfile

Results

Amazon China（search “phone”）

Configuration

Spiderfile

Results

V2ex

Configuration

Spiderfile

Results

36kr

Configuration

Spiderfile

Results

Summary

Reference

Written by Marvin Zhang