實作：在 Rails 專案使用 Nokogiri Gem 爬取網路文章資料

Published in

Lynn’s dev blog

4 min readJun 12, 2020

會接觸到 Nokogiri Gem 是因為正在實作網路文章資料爬取的功能，一開始使用第三方 Extractor API 來爬資料。

Extractor API 可以透過發送 request，回傳 json 檔，包含該篇文章的相關欄位，例如：url, title, author, text, html, clean_html 等，如此可大幅減少解析 html DOM 元素的時間。

不過，其中也有一些限制，例如部分網站擋 Extractor API 的 request 或是網站因採用特定前端框架使 clean_html 欄位無法擷取到（如圖中的 null）。

評估專案時程，80/20 原則，目前 80% 的抽取的樣本網站 Extractor API 均可正常回傳資料，剩下的 20% 則決定針對特定網站客製化解析方式。

以 keyword — html string to html rails 查詢到可以採用 Ruby Gem — Nokogiri 解析。

Nokogiri Gem

parse HTML with Ruby

Step1：在 Gemfile 中加上，並執行 bundle install

gem 'nokogiri', '~> 1.6', '>= 1.6.8'

Step2：require Nokogiri 及 open-uri 進專案

require 'nokogiri'
require 'open-uri'

Step3: 開始擷取你要的 DOM 元素

註：可以進 rails console 直接試試看爬取回來的格式是不是自己要的。

以此篇文章為例：依定 Dom 元素中特定屬性，並將取得的 Dom 轉成字串

require 'nokogiri'
require 'open-uri'url = 'https://5xruby.tw/posts/amazing-tips-to-profiling-the-running-ruby-container/'#用 Nokogiri 打開網址
page = Nokogiri::HTML(open( url ))#爬梳文章標題
title = page.xpath('//title').text#爬梳 og:image
ogimage_address = page.xpath('/html/head/meta[@property="og:image"]/@content').text
#爬梳文章摘要（每家網站的 meta tag 寫法結構不同，需針對網站調整）
short_description = page.xpath('/html/head/meta[@name="description"]/@content').text
#爬梳文章主文
result = page.xpath("//div[@class='post-main-content mb-3 mb-md-5']").to_s
#若只要純文字則可以這樣寫
result = page.xpath("//div[@class='post-main-content mb-3 mb-md-5']").text

如此一來就可以將解析出來的 HTML 存到資料庫中，交接給 View 來顯示嚕！

參考資料：

實作：在 Rails 專案使用 Nokogiri Gem 爬取網路文章資料

Nokogiri Gem

Written by 涓 / Lynn Chang