Develop a web crawler in 90 seconds
Background
Scraping on the web is an interesting thing, which allows you to collect web data through automated spider programs and it reduces a lot of manual work. Before the emergence of good spider frameworks, developers had to use simple HTTP requests plus web page parsing to develop spider programs, such as Python requests + BeautifulSoup. Data storage modules such as MySQL, MongoDB are integrated into more advanced spider programs. But because of the development inefficiencies and lack of robustness, building a comprehensive and production-ready crawler may take a couple of hours. I name this kind of web crawlers as Non-framework Crawlers.
In 2011, the Twisted-based web crawler framework named Scrapy become aware by the public and also the “second-to-none” high-performance asynchronous crawler framework. Scrapy has abstracted several core modules, which allows developers to focus on data extraction, rather than some fussy modules such as data downloading, page parsing and task coordination. Developing a production-ready Scrapy spider only takes about ten-ish minutes, or up to one hour with complicated requirements. There are also many good frameworks such as PySpider and Colly. I would call those spiders Framework Spiders. The Framework Spiders have released the productivity, and they are in corporate use in production environment with modification to crawl at a large scale.
However, for those who would like to crawl hundreds of sites, the Framework Spiders could be out of their capability. And developing spiders becomes manual work. For example, if the average time for developing a Framework Spider takes 20 minutes and a full-time spider developer works 8 hours per day, developing 1,000 sites would require 20,000 minutes, 333 hours and 42 workdays. Yet we can employ 10 full-time spider developers, but this would take 4 workdays to complete (as in the figure below).
This is again very inefficient. To overcome this efficiency issue, Configurable Spiders are created.
Introduction of Configurable Spiders
Configurable Spider as its literal meaning, is a spider which allows users to crawl data based on configured crawling rules. A Configurable Spider is a highly-abstracted spider program. Developers don’t have to write any spider code. Instead, they only need to configure site URLs, data fields and data attributes in the config files or database, read by a special spider program which would crawl web data accordingly. The Configurable Spiders abstracted the spider code into config info, which streamlined the spider development process. In this way, spider developers only need to make corresponding configs to complete a spider’s development. Therefore, developers would be able to use the Configurable Spiders to write large-scale spiders (as in the figure below).
This method make it possible for crawling hundreds of sites. A proficient spider configurer can configure 1,000 news site spider. This is very important to companies who require public sentiment monitoring, because the Configurable Spiders have increased productivity, reduced the unit working time cost and improved development efficiency, which makes it convenient for sentiment analysis and AI product development. Many companies are developing their own Configurable Spiders (the namings might be different, but actually the same thing) and have employed spider configurers to focus on configuring spiders.
There are not many open-source Configurable Spider frameworks available on the market. Earlier there is a framework named Gerapy developed by the spider guru Germy Cui who is working in Microsoft. It is a crawler admin platform, which is able to generate Scrapy project files based on config rules. A newer Configurable Spider framework is Crawlab, though it is a highly flexible spider admin platform. In v0.4.0, Crawlab released its Configurable Spider. Yet there is another framework named Ferret based on Golang. It is quite interesting as Ferret makes developing spiders as easy as writing SQL statements. There are some other commercial products, but they seem not professional enough in production use according to the user feedbacks.
The emergence of Configurable Spiders is mainly due to the simple patterns of web crawlers. It is basically the combination of list pages and detail pages (as in the figure below), or simply list pages. There are certainly some other more complicated general spiders, which could also being implemented through rules configuring.
Crawlab Configurable Spiders
What we are introducing today is the Configurable Spider in Crawlab. In this article the author has briefly introduced the main functionalities of Crawlab, but the Configurable Spider had not yet been developed by that time. In this article, we are going to focus on the practice of Crawlab Configurable Spiders. If you are not familiar with Crawlab’s Configurable Spiders, please refer to the documentation (Chinese).
Configurable Spiders Practice
All example spiders are configured by the author through the Configurable Spider on the Crawlab Official Demo Platform. Domains including news, finance, auto, books, video, search engine and developer community. We are going to introduce a few of them. All examples are available on the Official Demo Platform and you can sign-up to check them out.
Baidu (search “Crawlab”)
URL:http://crawlab.cn/demo#/spiders/5e27d055b8f9c90019f42a83
Configuration
Spiderfile
version: 0.4.4
engine: scrapy
start_url: http://www.baidu.com/s?wd=crawlab
start_stage: list
stages:
- name: list
is_list: true
list_css: ""
list_xpath: //*[contains(@class, "c-container")]
page_css: ""
page_xpath: //*[@id="page"]//a[@class="n"][last()]
page_attr: href
fields:
- name: title
css: ""
xpath: .//h3/a
attr: ""
next_stage: ""
remark: ""
- name: url
css: ""
xpath: .//h3/a
attr: href
next_stage: ""
remark: ""
- name: abstract
css: ""
xpath: .//*[@class="c-abstract"]
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36
Results
SegmentFault(Newest Articles)
URL:http://crawlab.cn/demo#/spiders/5e27d116b8f9c90019f42a87
Configuration
Spiderfile
version: 0.4.4
engine: scrapy
start_url: https://segmentfault.com/newest
start_stage: list
stages:
- name: list
is_list: true
list_css: .news-list > .news-item
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: title
css: h4.news__item-title
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: .news-img
xpath: ""
attr: href
next_stage: ""
remark: ""
- name: abstract
css: .article-excerpt
xpath: ""
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36
Results
Amazon China(search “phone”)
URL:http://crawlab.cn/demo#/spiders/5e27e157b8f9c90019f42afb
Configuration
Spiderfile
version: 0.4.4
engine: scrapy
start_url: https://www.amazon.cn/s?k=%E6%89%8B%E6%9C%BA&__mk_zh_CN=%E4%BA%9A%E9%A9%AC%E9%80%8A%E7%BD%91%E7%AB%99&ref=nb_sb_noss_2
start_stage: list
stages:
- name: list
is_list: true
list_css: .s-result-item
list_xpath: ""
page_css: .a-last > a
page_xpath: ""
page_attr: href
fields:
- name: title
css: span.a-text-normal
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: .a-link-normal
xpath: ""
attr: href
next_stage: ""
remark: ""
- name: price
css: ""
xpath: .//*[@class="a-price-whole"]
attr: ""
next_stage: ""
remark: ""
- name: price_fraction
css: ""
xpath: .//*[@class="a-price-fraction"]
attr: ""
next_stage: ""
remark: ""
- name: img
css: .s-image-square-aspect > img
xpath: ""
attr: src
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36
Results
V2ex
URL:http://crawlab.cn/demo#/spiders/5e27dd67b8f9c90019f42ad9
Configuration
Spiderfile
version: 0.4.4
engine: scrapy
start_url: https://v2ex.com/
start_stage: list
stages:
- name: list
is_list: true
list_css: .cell.item
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: href
fields:
- name: title
css: a.topic-link
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: a.topic-link
xpath: ""
attr: href
next_stage: detail
remark: ""
- name: replies
css: .count_livid
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: detail
is_list: false
list_css: ""
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: content
css: ""
xpath: .//*[@class="markdown_body"]
attr: ""
next_stage: ""
remark: ""
settings:
AUTOTHROTTLE_ENABLED: "true"
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/79.0.3945.117 Safari/537.36
Results
36kr
URL:http://crawlab.cn/demo#/spiders/5e27ec82b8f9c90019f42b59
Configuration
Spiderfile
version: 0.4.4
engine: scrapy
start_url: https://36kr.com/information/web_news
start_stage: list
stages:
- name: list
is_list: true
list_css: .kr-flow-article-item
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: title
css: .article-item-title
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: url
css: body
xpath: ""
attr: href
next_stage: detail
remark: ""
- name: abstract
css: body
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: author
css: .kr-flow-bar-author
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: time
css: .kr-flow-bar-time
xpath: ""
attr: ""
next_stage: ""
remark: ""
- name: detail
is_list: false
list_css: ""
list_xpath: ""
page_css: ""
page_xpath: ""
page_attr: ""
fields:
- name: content
css: ""
xpath: .//*[@class="common-width content articleDetailContent kr-rich-text-wrapper"]
attr: ""
next_stage: ""
remark: ""
settings:
ROBOTSTXT_OBEY: "false"
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/78.0.3904.108 Safari/537.36
Results
Summary
The Crawlab Configurable Spider is very convenient, allowing developers to quickly configure the spiders they need. Configuring 11 spiders above took less than 40 minutes (taking into account anti-crawl debugging time), and some easy spiders took 1–2 minutes. No code was written and all work was done on the interface. Furthermore, the Crawlab Configurable Spiders support configurations not only on the web interface, but also in Spiderfile
in Yaml
format. Actually, all configs are able to be mapped to Spiderfile. The Crawlab Configurable Spider is based on Scrapy, therefore it supports most features of Scrapy. You can use Setting to extend the spider settings such as USER_AGENT
, ROBOTSTXT_OBEY
, etc.
Why shall we select the Crawlab Configurable Spider as the primary option? Because they are not only able to be configured, but also share all core features of Crawlab, such as task coordination, cron jobs, log management, notifications, etc. In the future plan, Crawlab Dev Team would improve the Configurable Spiders, allowing them to support more features, such as dynamic content, more engines, realization of CrawlSpider
, etc.
Reference
- Github: https://github.com/crawlab-team/crawlab
- Demo: http://crawlab.cn/demo
- Documenation: http://docs.crawlab.cn/
If you think Crawlab is helpful in your daily development or companies, please add the author’s Wechat account “tikazyq1” and noting “Crawlab”. The author will add you in the discussion group. Welcome to star our project on Github. If you have any question, feel free to submit issues on Github, and also welcome to make contributions.
Note: the example in the figures are displayed in Chinese, yet you can still go the Demo (English is available) to view them in detail. Please feel free to contact me on Github ;)