[Python網頁爬蟲] Scrapy的安裝與使用入門-3

Sean Yeh

Published in

Python Everywhere -from Beginner to Advanced

15 min readJun 17, 2022

Scrapy這套開放原始碼框架，定義了完整的爬蟲流程與模組。透過它可以幫助我們快速且簡單的抓取網站的HTML頁面並取得資料，讓我們可以儲存該網頁資料並對資料進行近一步的解析。

我們在上一篇裡面已經實作了撰寫爬蟲專案my_spider_1中的 advertimes.py 檔案。透過裡面的爬蟲程式來對單一目標網頁進行資料的爬取，並且將結果存入CSV與JSON檔案中。

在這裡我們要進一步修改原程式，從該單一網頁中的各個超連結延伸出去的內頁，進行資料的爬取。

單一網頁至延伸內頁資料爬取

到目前為止，我們已經完成了單一目標網頁的爬蟲，接下來我們希望爬蟲可以循著超連結爬取內頁的內容。

絕對路徑與相對路徑

在進行爬取程式前，我們需要先暸解網站的路徑。網頁路徑的表示方式可分為兩種，一種是絕對路徑，另一種是相對路徑。

在Scrapy中要撰寫連到內頁的路徑時，可以透過「絕對路徑」與「相對路徑」兩種方式。

為了觀察路徑的問題，在此先單純化程式碼，只處理跟連結內頁路徑有關的link變數，其他的變數則先註解刪除，把焦點集中在link變數上。

絕對路徑

在Scrapy中使用絕對路徑有兩種方法，第一種是利用字串串接的方式將網站的domain與網頁內頁的連結組合起來，成為一組絕對路徑。

＃f-string字串串接

就如下面粗體字所示，我們利用f-string將網址 www.advertimes.com 與 link 串接起來並且指定給變數 absolute_url。用程式碼表示為： absolute_url = f'www.advertimes.com{link}'

將網址與link串接指定給變數 absolute_url後，再透過yield（ yield scrapy.Request(url=absolute_url)）將結果導出。

class AdvertimesSpider(scrapy.Spider):
    name = 'advertimes'
    allowed_domains = ['www.advertimes.com']
    start_urls = ['https://www.advertimes.com/global/']    def parse(self, response):
        
        articles = response.xpath('//ul[@class="article-list"]/li')
        for article in articles:
            # 內頁連結
            link = article.xpath(".//a/@href").get()
            #absolute_url
            absolute_url = f'www.advertimes.com{link}'
            
            yield scrapy.Request(url=absolute_url)

此外，與f-string相關的說明可以參考下面文章：

Python的字串格式語法.format()與f-string

在程式設計中，對於字串文字的處理是個免不了的工作。對於處理字串，Python內建各種不同的工具，本篇要討論的是關於Python處理字串格式化的各種語法。

medium.com

＃透過 response.urljoin

第二種方式是使用 response.urljoin 方法來取得路徑，並且指定給變數 absolute_url：

class AdvertimesSpider(scrapy.Spider):
    name = 'advertimes'
    allowed_domains = ['www.advertimes.com']
    start_urls = ['https://www.advertimes.com/global/']    def parse(self, response):
        
        articles = response.xpath('//ul[@class="article-list"]/li')
        for article in articles:
            # 內頁連結
            link = article.xpath(".//a/@href").get()
            
            #absolute_url
            absolute_url = response.urljoin(link)            yield scrapy.Request(url=absolute_url)

一樣再透過yield（ yield scrapy.Request(url=absolute_url) ）導出結果。

相對路徑

在Scrapy中要使用相對路徑的話，可以透過 response.follow() 函式。這個函式可以處理延伸頁面的爬取，它會傳回Request物件。

由於 response.follow() 函式的第一個參數可支援相對路徑，因此在yield的地方只要使用response.follow函式（ response.follow(url=link) ），就可以取得內頁的連結，並且不再需要用到前面我們在處理絕對路徑上需要的 response.urljoin 來處理URL路徑。

class AdvertimesSpider(scrapy.Spider):
    name = 'advertimes'
    allowed_domains = ['www.advertimes.com']
    start_urls = ['https://www.advertimes.com/global/']    def parse(self, response):
        
        articles = response.xpath('//ul[@class="article-list"]/li')
        
        for article in articles:
            # 內頁連結
            link = article.xpath(".//a/@href").get()
            
            #relative_url
            yield response.follow(url=link)

比較之下，使用相對路徑似乎比使用絕對路徑的方式較為簡潔的。接下來我們也會使用相對路徑的方式處理。

爬取延伸內頁

由於我們希望爬蟲可以順著網頁的<a>連結爬取裡面的內容，這裡就需要利用到response.follow() 函式裡面的第二個參數「callback」。

這個參數用來指定當抓取到第一個參數（即網頁的<a>連結）引導的頁面資料後，用來進行資料剖析的函式是哪一個？

在這裡，我們要呼叫parse_page函式來進行<a>連結頁面的資料剖析。parse_page函式會爬取內頁的資料並取出我們需要的部分資訊。

yield response.follow(url=link, callback=self.parse_page)

接下來就要來撰寫這個函式。

parse_page函式

在撰寫parse_page函式之前，需要先觀察一下<a>連結內頁的結構。

# 觀察

如上圖，發現我們需要的資料都被包含在class名稱為 entry-txt 的 div 元素裡面（ <div class="entry-txt"> ）。有class名稱為lead的<div>元素（符合條件的總共有兩組，我們只需要第一組<div>元素），以及數個<p>元素。

<div class="entry-txt">
  <div class="lead">
    自動運転や….
   </div>
   …
   <p></p>
   …
  <p></p>
  <h2>メイン会場からタクシーの列が消えた</h2>
  <p></p>
  <p></p>
  <p></p>
  <p></p>
</div>

# 轉換為XPath、變數

我們一樣可以將這些觀察結果，轉換為XPath表示。

先找到包裹所需資料div 元素的XPath，並透過response.xpath()處理後指派給articles變數。

articles = response.xpath('//div[@class="entry-txt"]/text()')

再進一步抓取內頁各個段落文字<p>元素，依照同樣的步驟轉換為XPath表示後，指派給paragraph變數。請注意，這裡的XPath要從articles變數開始。

paragraph =article.xpath('.//p/text()')

# for迴圈取值

再使用for迴圈取出內頁的資料。在這裡例子中，由於段落（paragraph）的部分有很多組<p>，所以需要用 .getall() 取資料；而摘要（lead）的部分只有一個class="lead”符合我們要的資料，所以在這裡使用 .get() 取資料：

for article in articles:
    lead = article.xpath('.//div[@class="lead"][1]/text()').get()
    paragraph =article.xpath('.//p/text()').getall()

最後，透過yield把取得的資料lead與paragraph傳回：

yield {
    'lead':lead,
    'paragraph':paragraph
 }

# 組合程式碼

把上述的程式碼組合在一起，完成後parse_page的函式如下：

def parse_page(self,response):
    articles = response.xpath('//div[@class="entry-txt"]')
        
    for article in articles:
        lead= article.xpath('.//div[@class="lead"][1]/text()').get()
        paragraph= article.xpath('.//p/text()').getall()
            
        yield {
           'lead':lead,
           'paragraph':paragraph
           }

# 爬取指令、輸出

執行crawl指令，並且輸出檔案到JSON。

$ scrapy crawl advertimes -o advertimes.json

執行上述指令後可以得到下面的結果。JSON檔案裡面有 lead 與 paragraph的Key，並且有其對應的值。

檢查JSON檔案中的內容，可以發現內頁的資料都進入這個JSON檔案中了。

合併主、內頁資料

到這裡為止看似完美，但我們尚存在一個問題。觀察JSON檔案可以發現，雖然檔案中有內頁的資料（lead與 paragraph的資料），但卻缺少了上一頁主頁的標題。如此結果導致爬下來的資料有尾無頭，不知道該資料到底與什麼有關。

接下來，要修改程式把上一頁的主頁資料合併進來。修改方式如下：

＃在parse函式的部分

<1> response.follow加上meta

首先要在yield後面的response.follow方法中增加一個meta屬性。因為我們需要的資料係位於上一頁的標題（title）。於是需要在meta屬性中加上title（ meta={'title':title} ）

yield response.follow(url=link, callback=self.parse_page, meta={'title':title})

<2> 函式裡面加上title變數

title變數可以取得頁面上的標題，此為前面單一頁面時我們撰寫的程式碼，在這裡取消註解（移除#符號）把它還原回來即可。

def parse(self, response):
        articles = response.xpath('//ul[@class="article-list"]/li')
        for article in articles:
            # 內頁連結
            link = article.xpath(".//a/@href").get()
            title = article.xpath('.//a/div[@class="article-list-txt"]/h3/text()').get()            
            # content = article.xpath('.//a/div[@class="article-list-txt"]/p/text()').get()
            # update_date = article.xpath('.//a/div[@class="article-list-txt"]//span[@class="update-date"]/text()').get()

＃在parse_page函式的部分

<1> 函式裡面加上title變數

在爬取內頁的parse_page函式裡面，也需要加上title變數，而這個變數取得資料的來源是meta。我們要使用response.request來取出meta的資料，它的key為title。

title = response.request.meta['title']

<2>函式的yield 補上title

最後，要在parse_page的yield 加上 title，如下 'title':title, 。

yield {
    'title':title,
    'lead':lead,
    'paragraph':paragraph
}

# 組合程式碼

將上面的各個部分組合起來，最後產生出完整程式碼如下：

import scrapyclass AdvertimesSpider(scrapy.Spider):
    name = 'advertimes02'
    allowed_domains = ['www.advertimes.com']
    start_urls = ['https://www.advertimes.com/global/']    def parse(self, response):
        
        articles = response.xpath('//ul[@class="article-list"]/li')
        for article in articles:
            # 內頁連結
            link = article.xpath(".//a/@href").get()
            title = article.xpath('.//a/div[@class="article-list-txt"]/h3/text()').get()            
            # content = article.xpath('.//a/div[@class="article-list-txt"]/p/text()').get()
            # update_date = article.xpath('.//a/div[@class="article-list-txt"]//span[@class="update-date"]/text()').get()
              
            
            #relative_url
            yield response.follow(url=link,callback=self.parse_page,meta={'title':title})
    
    def parse_page(self,response):
        title = response.request.meta['title']
        articles = response.xpath('//div[@class="entry-txt"]')
        
        for article in articles:
            lead = article.xpath('.//div[@class="lead"][1]/text()').get()
            paragraph =article.xpath('.//p/text()').getall()
            
            yield {
                'title':title,
                'lead':lead,
                'paragraph':paragraph
                }

# 爬取指令、輸出

再次執行crawl指令，並且輸出檔案到JSON。

$ scrapy crawl advertimes -o advertimes.json

執行上述指令後可以得到下面的結果。

這次輸出的JSON檔案裡面，除了原本的lead與paragraph之外，就有來自上一頁的標題（title）了。

以上是Scrapy爬蟲框架針對單一網頁至延伸內頁資料的爬取方式。至於如何爬取下一頁網頁以及user-agent的設定等細節，我們留待下一篇說明。

[Python網頁爬蟲] Scrapy的安裝與使用入門-4

Scrapy這套開放原始碼框架，定義了完整的爬蟲流程與模組。透過它可以幫助我們快速且簡單的抓取網站的HTML頁面並取得資料，讓我們可以儲存該網頁資料並對資料進行近一步的解析。

medium.com

[Python網頁爬蟲] Scrapy的安裝與使用入門-3

單一網頁至延伸內頁資料爬取

絕對路徑與相對路徑

絕對路徑

Python的字串格式語法.format()與f-string

在程式設計中，對於字串文字的處理是個免不了的工作。對於處理字串，Python內建各種不同的工具，本篇要討論的是關於Python處理字串格式化的各種語法。

相對路徑

爬取延伸內頁

parse_page函式

合併主、內頁資料

[Python網頁爬蟲] Scrapy的安裝與使用入門-4

Scrapy這套開放原始碼框架，定義了完整的爬蟲流程與模組。透過它可以幫助我們快速且簡單的抓取網站的HTML頁面並取得資料，讓我們可以儲存該網頁資料並對資料進行近一步的解析。

Written by Sean Yeh