【爬蟲有專攻】初探 Scrapy 爬蟲 — — 以爬取 15 萬筆線上醫療咨詢 QA 為例子

Nero Un Chi Hin 阮智軒

36 min readOct 28, 2023

（20240227 更新）——HuggingFace 資料集更新

前言 / 為甚麼？

之前在研究所的課程作業要求大家爬超過 10,000,000 個網頁，以前使用 bs4 + selenium 的經驗不太夠用。於是在一番研究後找到 Scrapy 這個開源的爬蟲框架。

本文架構

甚麼是 Scrapy
Scrapy 的特點
正文

甚麼是？

Scrapy

Scrapy 是 Python 的Crawling Framework，在 GitHub 上有 46,481 星星的好評，從 Data Mining、Monitoring 到自動化測試都有相對應的模組可以使用。相較其他爬蟲常見的工具 BeautifulSoup 和 Selenium 而言，Scrapy 是一個完整的框架，不需要太多其他的 pkg 就有很好的爬取成果（至於前端就是 Vue 和 Angular 的差別了 XD）；但它們還可以整合來一起使用、達到更好的爬蟲效果。

Scrapy 的特點

由於 Scrapy 計設是用於大規模的爬蟲，本身是支援 asynchronous 的、意味可以同時處理大量的 Request，另外可以設置AutoThrottling 避免爬取太快而產生異常流量而被封鎖的情況。

接下來介紹 Scrapy 不同的 Component：

Scrapy Engine：Scrapy的核心模組，負責控制每個 Component 的之間的傳遞 Request 和資料。
Spider：爬蟲程式；從這裡來定義要進行 Crawling 的對象。
Scheduler：排程器；用來針對目前 Spider 發起的 Request 來進行任務排程。
Item Pipeline：負責 Spider 所抓取的資料進行處理並儲存。
Downloader：負責下載由 Spider 向 Scheduler 爬取目標網站的 HTML。
Middleware：負責在 Spider 發送 Request 時執行一些客制化的功能。

How it’s work?

在 Spider 定義要爬取的網站後，Spider 會先向 Engine發送 Request。
Engine在接收到Request 後會發送到Scheduler將其進行排程。
Scheduler 在安排任務排程後會將資料送點 Engine。
Engine 會發送 Request 請 Downloader 下載網站的 HTML。
Downloader 下載成功後會向 Engine 回傳 Response。
Engine 將Response 發送給 Spider 進行解析。
Spider 擷取 Response 裡的 HTML 後傳送到 Item Pipeline 中進行處理。
Item Pipeline 處理完 Response 後會將資料儲存。

Get Started

在使用 Scrapy 之前，需要先安裝 Scrapy。

pip install scrapy

接著，可以使用 scrapy startproject 指令建立一個新的 Scrapy 專案。

scrapy startproject ${your_project_name}

輸入指令後會生成相對應的專案資料夾，也和上面提到的 scrapy component 相同，結構如下：

├── ${your_project_name}
│   ├── __init__.py
│   ├── items.py
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       └── __init__.py
└── scrapy.cfg

在 settings.py 檔案中，可以設定 Scrapy 的一些參數，例如使用者代理、下載延遲等等。

USER_AGENT = 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
DOWNLOAD_DELAY = 0.5

建立一個新的爬蟲檔案的也很簡單，新增一個 .py 檔在 spiders 資料夾中就完成。其中 start_urls 是要爬取的目標，函式 parse 中撰寫的 Selector 和 Item 來進行網頁解析和資料提取。

from scrapy.selector import Selector
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['<http://www.example.com>']

    def parse(self, response):

        selector = Selector(response)
        item = MyItem()
        item['title'] = selector.xpath('//title/text()').extract_first()
        
    yield item

或是使用 Scarpy 內建的 Command 也有樣的效果：

scrapy genspider ${spider_name} ${target_domain_address}

最後，使用 scrapy crawl 指令執行爬蟲。

scrapy crawl ${your_spider_name}

範例

這邊以爬取臺灣 e 院為例子，臺灣 e 院是由衛生福利部所提供的線上醫療諮詢服務平台，由專業醫事人員在網路上提供答詢與引導就醫，節省病人轉診摸索的寶貴時間，避免錯誤看診浪費醫療資源。通過這個爬蟲，我們可以收集醫院的部門資訊、醫生的回答以及相關的文章資訊。

爬取目標

我們要爬取的對象是不同科別中的 Q&A 資料，從連結我們得知他是有一定的格式的，爬取這種網頁時會很方便。

從右上角的咨詢科別中可以獲得所有科別的連結，經 F12 開發者工具中我們可以看到他們共用同一個 class w3-col l4 m6 s6，所以可以使用以下程式碼來提取出科別列表：

department_list = response.css('div.w3-col.l4.m6.s6 a::attr(href)').extract()

而文章的連結也是具有相當的結構，以下面的 Q&A 為例：https://sp1.hso.mohw.gov.tw/doctor/All/ShowDetail.php?q_no=201839&SortBy=q_no&PageNo=1 q_no 是對應的文章編號、PageNo 對應是在科別的第一頁中。我們的目標是爬取「文章連結、文章編號、文章標題、文章內容、醫師科別、醫師名稱、醫師回覆」等資訊。

爬取結構

動手做

我們裡 spider 新增一個叫 tweh 的爬蟲。總共分為以下幾個部份:

tweh.py

爬蟲初始化和基本配置:
爬蟲的名字設定為 tweh，並且限制了爬取的域名只能是 sp1.hso.mohw.gov.tw。
start_urls 為臺灣 e 院的主頁。
在 __init__ 方法中，設定了起始頁面號碼為 1。

import scrapy
from ..items import TaiwanEHospitalsItem
from tqdm import tqdm
from datetime import datetime


class TwehSpider(scrapy.Spider):

    name = 'tweh'
    allowed_domains = ['sp1.hso.mohw.gov.tw']

    start_urls = ['https://sp1.hso.mohw.gov.tw/doctor/Index1.php']

    def __init__(self):
        self.start_page_number = 1

解析主頁 (parse 函式):

更新了已爬取的網站數量。
每小時在日誌中記錄已爬取的網站數量。
通過 CSS Selector 從主頁中提取所有的科別列表，並且對每個科別生成一個新的 Request 以獲取該科別的歷史頁面。

def parse(self, response):

        self.crawler.stats.inc_value('scraped_count')
        
        # check if an hour has passed since the last log
        if datetime.now().minute == 0:
            scraped_count = self.crawler.stats.get_value('scraped_count', 0)
            self.logger.info(f'Crawled {scraped_count} websites in the last hour')


        # example the department is in HTML <div class="w3-col l4 m6 s6"><a href="/doctor/All/index.php?d_class=內科" target="_top">內 科</a></div>

        department_list = response.css(
            'div.w3-col.l4.m6.s6 a::attr(href)').extract()

        history_page_param = '/doctor/All/history.php?UrlClass='

        for department in tqdm(department_list):
            department_name = department.split('=')[1]

            # next I want to extract the history of the department, from /doctor/All/history.php?UrlClass= {{ department }}
            # yield scrapy.Request(url=response.urljoin(history_page + department_name), callback=self.parse_history)

            history_page_number = scrapy.Request(url=response.urljoin(
                history_page_param + department_name), callback=self.parse_history_page_option_value)

            yield history_page_number

解析歷史頁面選項值 (parse_history_page_option_value 函式):

從歷史頁面中提取所有的頁面頁碼。
如果沒有找到頁面頁碼，則重新 Request 當前頁面並解析文章。
如果找到了頁面頁碼，則從第 1 頁到最後一頁生成新的Request 並且解析每個頁面上的文章。

def parse_history_page_option_value(self, response):

        # exam//*[@id="PageNo"]/option[ number]

        history_page_number = response.css(
            '#PageNo option::attr(value)').extract()

        if len(history_page_number) == 0:
            yield scrapy.Request(url=response.urljoin(response.url), callback=self.parse_history_article)
        else:
            # start from 1 to the last page
            last_history_page_number = history_page_number[-1]

            for page_number in range(self.start_page_number, int(last_history_page_number) + 1):
                yield scrapy.Request(url=response.urljoin(response.url + '&SortBy=q_no&PageNo=' + str(page_number)), callback=self.parse_history_article)

解析歷史文章 (parse_history_article 函式):

在這個方法中，首先創建了一個 TaiwanEHospitalsItem 來保存數據。
然後從每個歷史頁面中提取所有的文章的 URL ，並且對每個文章生成新的請求以獲取文章的詳細資訊。

def parse_history_article(self, response):

        items = TaiwanEHospitalsItem()

        # when into this page, I want to extract the history article in the page, pattern is https://sp1.hso.mohw.gov.tw/doctor/All/ShowDetail.php?q_no=193985&SortBy=q_no&PageNo=1
        # the xpath is //*[@id="sidebar-content"]/div/div/div[1]/form/table/tbody/tr[1]/td[7]/a
        
        article_list = response.css('form table tbody tr td a::attr(href)').extract()

        for article in article_list:


            # article_department https://sp1.hso.mohw.gov.tw/doctor/All/history.php?UrlClass=%E9%AB%94%E9%81%A9%E8%83%BD&SortBy=q_no&PageNo=13, department is %E9%AB%94%E9%81%A9%E8%83%BD
            items['article_department'] = response.url.split('=')[
                1].split('&')[0]

            yield scrapy.Request(url=response.urljoin(article), callback=self.parse_article, meta={'item': items})

解析文章 (parse_article 函式):

從 Request 中獲取 item 。
提取文章連結、文章編號、文章標題、文章內容、醫師科別、醫師名稱、醫師回覆，並保存在 item 中。
最後，將 item 輸出，以便將數據保存到本地或者發送到某個線上的數據庫。

def parse_article(self, response):

        # when into this page, I want to extract the article information, pattern is https://sp1.hso.mohw.gov.tw/doctor/All/ShowDetail.php?q_no=193985&SortBy=q_no&PageNo=1

        items = response.meta['item']
        
        items['crawl_time'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        items['article_url'] = response.url

        # article_no xpath like /html/body/div/div/div/div/div[2]/div[1]/div[1]/h2
        items['article_no'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[1]/div[1]/h2/text()').extract()

        # article_name xpath like /html/body/div/div/div/div/div[2]/div[1]/div[2]/h2
        items['article_name'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[1]/div[2]/h2/text()').extract()

        # article_doctor xpath like /html/body/div/div/div/div/div[2]/div[3]/div[1]/div[2]/div
        items['article_doctor'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[3]/div[1]/div[2]/div/text()').extract()

        # article_content xpath like /html/body/div/div/div/div/div[2]/div[2]/div[2]/div
        items['article_content'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[2]/div[2]/div/text()').extract()

        # article_answer xpath like /html/body/div/div/div/div/div[2]/div[3]/div[2]/div
        items['article_answer'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[3]/div[2]/div/text()').extract()

        yield items

完整的 Spider 程式碼

import scrapy
from ..items import TaiwanEHospitalsItem
from tqdm import tqdm
from datetime import datetime


class TwehSpider(scrapy.Spider):

    name = 'tweh'
    allowed_domains = ['sp1.hso.mohw.gov.tw']

    start_urls = ['https://sp1.hso.mohw.gov.tw/doctor/Index1.php']

    def __init__(self):
        self.start_page_number = 1

    def parse(self, response):

        self.crawler.stats.inc_value('scraped_count')
        
        # check if an hour has passed since the last log
        if datetime.now().minute == 0:
            scraped_count = self.crawler.stats.get_value('scraped_count', 0)
            self.logger.info(f'Crawled {scraped_count} websites in the last hour')


        # example the department is in HTML <div class="w3-col l4 m6 s6"><a href="/doctor/All/index.php?d_class=內科" target="_top">內 科</a></div>

        department_list = response.css(
            'div.w3-col.l4.m6.s6 a::attr(href)').extract()

        history_page_param = '/doctor/All/history.php?UrlClass='

        for department in tqdm(department_list):
            department_name = department.split('=')[1]

            # next I want to extract the history of the department, from /doctor/All/history.php?UrlClass= {{ department }}
            # yield scrapy.Request(url=response.urljoin(history_page + department_name), callback=self.parse_history)

            history_page_number = scrapy.Request(url=response.urljoin(
                history_page_param + department_name), callback=self.parse_history_page_option_value)

            yield history_page_number

    def parse_history_page_option_value(self, response):

        # exam//*[@id="PageNo"]/option[ number]

        history_page_number = response.css(
            '#PageNo option::attr(value)').extract()

        if len(history_page_number) == 0:
            yield scrapy.Request(url=response.urljoin(response.url), callback=self.parse_history_article)
        else:
            # start from 1 to the last page
            last_history_page_number = history_page_number[-1]

            for page_number in range(self.start_page_number, int(last_history_page_number) + 1):
                yield scrapy.Request(url=response.urljoin(response.url + '&SortBy=q_no&PageNo=' + str(page_number)), callback=self.parse_history_article)

    def parse_history_article(self, response):

        items = TaiwanEHospitalsItem()
        # when into this page, I want to extract the history article in the page, pattern is https://sp1.hso.mohw.gov.tw/doctor/All/ShowDetail.php?q_no=193985&SortBy=q_no&PageNo=1
        # the xpath is //*[@id="sidebar-content"]/div/div/div[1]/form/table/tbody/tr[1]/td[7]/a
        article_list = response.css(
            'form table tbody tr td a::attr(href)').extract()

        for article in article_list:


            # article_department https://sp1.hso.mohw.gov.tw/doctor/All/history.php?UrlClass=%E9%AB%94%E9%81%A9%E8%83%BD&SortBy=q_no&PageNo=13, department is %E9%AB%94%E9%81%A9%E8%83%BD
            items['article_department'] = response.url.split('=')[
                1].split('&')[0]

            yield scrapy.Request(url=response.urljoin(article), callback=self.parse_article, meta={'item': items})

    def parse_article(self, response):

        # when into this page, I want to extract the article information, pattern is https://sp1.hso.mohw.gov.tw/doctor/All/ShowDetail.php?q_no=193985&SortBy=q_no&PageNo=1

        items = response.meta['item']
        
        items['crawl_time'] = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        
        items['article_url'] = response.url

        # article_no xpath like /html/body/div/div/div/div/div[2]/div[1]/div[1]/h2
        items['article_no'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[1]/div[1]/h2/text()').extract()

        # article_name xpath like /html/body/div/div/div/div/div[2]/div[1]/div[2]/h2
        items['article_name'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[1]/div[2]/h2/text()').extract()

        # article_doctor xpath like /html/body/div/div/div/div/div[2]/div[3]/div[1]/div[2]/div
        items['article_doctor'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[3]/div[1]/div[2]/div/text()').extract()

        # article_content xpath like /html/body/div/div/div/div/div[2]/div[2]/div[2]/div
        items['article_content'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[2]/div[2]/div/text()').extract()

        # article_answer xpath like /html/body/div/div/div/div/div[2]/div[3]/div[2]/div
        items['article_answer'] = response.xpath(
            '/html/body/div/div/div/div/div[2]/div[3]/div[2]/div/text()').extract()

        yield items

pipeline.py

有了爬蟲後就要定義對內容進行甚麼樣的處理，在 scrapy 中 pipeline 就是對應的是架構圖中 item pipeline 的 pipeline，負責對爬回來的資料進行處理。

DuplicateUrlPipeline:

建立一個 set urls_seen 來記錄已見過的 URL。
如果 article_url 已在 set 中，則跳過該 url ；否則，將 article_url 添加到 set 中。

SkipItemPipeline:

如果爬取的 url 以 .pdf、.doc 或 .docx 結尾，則去除該 URL 的項目。

SkipEmailPipeline:

如果爬取的 url 以 mailto: 開頭，則去除該 URL 的項目。

TaiwanEHospitalsPipeline:

定義了一個 clean_text 函數來清理文本，移除不必要的字串。
清理 article_answer、article_content 和 article_name 的指定內容。
從 article_doctor 字段中提取醫師的姓名。
將 article_department 字段的值進行 URL 解碼。
從 article_no 字段中移除 # 字符。
將 article_url 字段的值進行 MD5 編碼。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
from urllib.parse import unquote
import re
from .items import MillionsCrawlerItem
from hashlib import md5



class DuplicateUrlPipeline:
    def __init__(self):
        self.urls_seen = set()

    def process_item(self, item, spider):
        if item['article_url'] in self.urls_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.urls_seen.add(item['article_url'])
            return item


class SkipItemPipeline:
    '''
    Skip item if the url ends with .pdf, .doc, or .docx
    '''

    def process_item(self, item, spider):
        if item['url'].endswith('.pdf') or item['url'].endswith('.doc') or item['url'].endswith('.docx'):
            raise DropItem("Skip item found: %s" % item)
        else:
            return item


class SkipEmailPipeline:
    '''
    Skip item if the url starts with mailto:
    '''

    def process_item(self, item, spider):
        if item['url'].startswith('mailto:'):
            raise DropItem("Skip item found: %s" % item)
        else:
            return item


class TaiwanEHospitalsPipeline:

    def process_item(self, item, spider):

        def clean_text(text):
            cleaned_text = ''.join(text)
            cleaned_text = cleaned_text.replace('\r', '').replace('\n', '').replace(
                ' ', '').replace('\\', '').replace('\u3000', '').replace('\xa0', '').replace('\t', '')
            return cleaned_text

        # Clean article_answer
        if isinstance(item['article_answer'], list):
            item['article_answer'] = clean_text(item['article_answer'])
        else:
            item['article_answer'] = clean_text(item['article_answer'])

        # Clean article_content
        if isinstance(item['article_content'], list):
            item['article_content'] = clean_text(item['article_content'])
        else:
            item['article_content'] = clean_text(item['article_content'])

        # Clean article_name
        if isinstance(item['article_name'], list):
            item['article_name'] = clean_text(item['article_name'])
        else:
            item['article_name'] = clean_text(item['article_name'])

        # extract the doctor name from article_doctor
        item['article_doctor'] = ''.join(item['article_doctor'])
        item['article_doctor'] = re.findall(
            r'／([^，]+),', item['article_doctor'])[0]

        if '／' in item['article_doctor']:
            item['article_doctor'] = item['article_doctor'].split('／')[0]

        # decode article_department, example %E4%B8%AD%E9%86%AB%E7%A7%91 => 中醫科
        item['article_department'] = unquote(item['article_department'])

        # remove the article_no #, example #123456 => 123456, the article_no type is list
        item['article_no'] = ''.join(item['article_no']).replace('#', '')

        # use md5 to compress the url
        item['article_url'] = md5(
            item['article_url'].encode('utf-8')).hexdigest()

        return item

item.py

item.py 對應的是架構圖中 item pipeline 的 item，功能是對抓取的資料進行儲存

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class TaiwanEHospitalsItem(scrapy.Item):
    # with tweh
    crawl_time = scrapy.Field()
    article_department = scrapy.Field()
    article_doctor = scrapy.Field()
    article_url = scrapy.Field()
    article_no = scrapy.Field()
    article_name = scrapy.Field()
    article_content = scrapy.Field()
    article_answer = scrapy.Field()
    pass

setting.py

最後來到 setting 的部份，要從 pipeline 中的 class 新增到 ITEM_PIPELINES 才能生效。

# Scrapy settings for millions_crawler project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import datetime

# load the .env file
load_dotenv()

BOT_NAME = "millions_crawler"

# SCHEDULER = "scrapy_redis.scheduler.Scheduler"

SPIDER_MODULES = ["millions_crawler.spiders"]


# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 100


# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False


# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   "millions_crawler.pipelines.TaiwanEHospitalsPipeline": 300,
   "millions_crawler.pipelines.DuplicateUrlPipeline": 350,
   "millions_crawler.pipelines.SkipItemPipeline": 350,
   "millions_crawler.pipelines.SkipEmailPipeline": 600,

}


# Set settings whose default value is deprecated to a future-proof value
REQUEST_FINGERPRINTER_IMPLEMENTATION = "2.7"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
FEED_EXPORT_ENCODING = "utf-8"

成果

最後我們 1.3 小時裡爬取了 15 萬篇文章。後續我有將爬取回來的資料存到 mongoDB 中，詳情可以看這個 repo。

（20240227 更新）

一直忘了更新 XD 我最後有把資料集放到 HuggingFace 上了，歡迎取用

NeroUCH/online-health-chating · Datasets at Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

總結

這是一篇拖稿了大半年的技術筆記 XD

身為超音波研究室的研究生對 NLP 或是爬蟲是絕對沒有同類型研究室來得專業；因此在碩二下才跑去別人的研究室專論課程中修了一學期的 Web Mining / NLP 等等技術。

授課的盧文祥老師是成大資工的 NLP / Web Mining / ChatBot 專家，他們的研究室也是在成大資工系 8 樓、可謂是鄰居一般的存在 XD （超音波實驗室也在 8 樓）。

接觸 Scrapy 前都只使用 BeautifulSoup 和 Request 加其他工具來爬取資料，老實說在寫法和複用性上都是很差的；因應資料後續是有規畫用於 LLM 的訓練，所以會同時爬取不同線上醫療咨詢問網站，所以要挑選新的框架來測試看看能不能更有效率。事實也證明後續爬多個不同網站、共獲得 827, 798 筆資料時花不到 1 天，總開發時間也不到 3 天，真是非常好用和易上手！

需然老師每次的作業目標每次聽下來都很嚇人，甚麼爬百萬計的網站、手動做出一個搜尋引擎等等。但也因為這次修課的機會學習到更多網絡探勘及相關的知識。

也和大家分課上課時的筆記地圖（也幫 Heptabase 宣傳一下，它是一款台灣團隊開發的知識管理軟體，我從研究生修課做研究到現在當顧問分析問題都在使用，大推！）

結尾 MurMur

謝謝您看到這裡，如果文章對您有幫助可以幫我拍個手！
也歡迎找我交流你的心得和想法～
LinkedIn: https://www.linkedin.com/in/nerouch/
GitHub: https://github.com/NeroHin