Dcard爬蟲、每日發文量繪圖

Published in

自己的R筆記

7 min readNov 8, 2019

觀察網站資料結構，在Dcard用inspect觀察，不斷往下拉網頁的時候Network XDR會不斷load出名為post的東東，進去點開會發現是Json檔網頁，每30個post為單位，但卻找不到第一頁的30筆，而是從31~60開始。觀察Json網頁會發現三點：

https://www.dcard.tw/_api/forums/relationship/posts?popular=false&limit=30&before=232421868

popular=false or true：這項true代表的是以人氣作為排列，但不會按照時間序列，如果想找全部按照時間順序的資料的話，要選popular=fasle
limit=30：意思是每次取回30筆資料，預設就是這樣，試過發現可以把它改成其他數字不影響
before=232421868：觀察後可以發現這串數字代表的是id名稱，並且恰好是上一筆Json網頁最後一個id的號碼，所以代表你取完第一筆Json網頁後，想要再找更以前的資料，方法就是把這串id編號改為這次Json的最後一個id，而第一筆Json資料其實就是後面沒有id名稱的https://www.dcard.tw/_api/forums/relationship/posts?popular=false

了解結構後就可以開始撰寫R code

library(tidyverse)
library(httr)
library(jsonlite)
library(lubridate)
options(stringsAsFactors = F)

第一步先裝好該用的套件

url = “https://www.dcard.tw/_api/forums/relationship/posts?popular=false"
res2 = GET(url) %>%
 content(“text”) %>%
 fromJSON() %>%
 select(-anonymousSchool, -anonymousDepartment, -pinned, -meta, -mediaMeta, -layout, -withImages, -withVideos, -media, -reportReasonText)

fix_url2 = “https://www.dcard.tw/_api/forums/relationship/posts?popular=false"
last_id2 = last(res2$id) #檢查是否為最後一個id
print(last_id2) #印出來看看

接著開始爬頭一筆資料，記得Json檔案要用GET()獲取網頁，content(“text”)取得文字，再用fromJSON將其轉為R的資料結構（通常是dataframe or list），因為最後轉為df的欄位有許多不重要，所以用select()擷取有用的欄位即可。

for (i in 1:50) {
 url2 = paste0(fix_url2, “&limit=100&before=”, last_id2) %>%
 print()
 resdata = GET(url2) %>%
 content(“text”) %>%
 fromJSON() %>%
 select(-anonymousSchool, -anonymousDepartment, -pinned, -meta, -mediaMeta, -layout, -withImages, -withVideos, -media, -reportReasonText)
 last_id2 = last(resdata$id)
 res2 = bind_rows(res2, resdata) 
 message(nrow(resdata))
}

用迴圈進行資料爬蟲，比較重要的是bind_row()函數，可以將兩個df合併到一起，爬蟲蠻常用到。

time = res2 %>%
 select(id, title, excerpt, createdAt, updatedAt) %>%
 mutate(createdAt = str_replace_all(res2$createdAt, “[TZ]”, “ “)) %>%
 mutate(createdAt = as.POSIXct(res2$createdAt))
time$createdAt[2] #印出來看看是否是時間樣態（[1] "2019-11-08 CST"）

因為最後想要畫出每天發文量的圖，所以想要原本dataframe中發文時間從character轉為時間，方便之後畫圖用，因此用str_replace_all() 將原本時間欄位裡面有的奇怪符號 “T”, “Z”拿掉（2019–11–08T04:53:17.195Z）重要的是這邊用的replace_all而非一般的replace，str_replace_all 可以拿掉多個字，所以當想要除掉的字元不連在一起時這個函數就很好用。

接著用as.POSIXct函數將它轉為時間樣態，as.POSIXct這些轉為時間的函數其實蠻複雜，自己玩了一會兒才比較了解。

ggplot(time, aes(x = res3$createdAt)) +
  geom_histogram(aes(fill = ..count..)) +
  scale_x_datetime(name = "per day", 
                   date_breaks = "1 day", 
                   date_labels = "%m/%d") +
  scale_y_continuous(name = "Count",
                     breaks = seq(0, 350, by = 50)) +
  ggtitle("Frequency histogram of Dcard posts") +
  scale_fill_gradient("Count", low = "grey", high = "red") +
  theme_bw() +
  theme(axis.text.x = element_text(size = 7, angle = 45, hjust = 1),
        axis.text.y = element_text(size = 7))

最後用ggplot2畫圖，有蠻多重要的東西：

geom_histogram(aes(fill = ..count..))
畫長條圖統計次數時，因為沒有y，所以可以用geom_histogram(aes(y = ..count..))，並且可以將y改為fill填色。
scale_y_continuous(name = “Count”, breaks = seq(0, 350, by = 50))
可以將y座標命名、並設定級距seq(0, 350, by = 50)，代表從上下界線為0跟350，並且以50為級距在圖上標出。
scale_x_datetime(name = “per day”, date_breaks = “1 day”, date_labels = “%m/%d”)
跟上一項類似，不過是專門給時間型態的x or y，date_breaks代表 x 的級距可以 “1 day(hour)”等等，date_label可以決定圖上 x 軸時間的標示，像這邊用的 “%m/%d” 代表月/日。
（更多時間表記方式可以參考strptime）
scale_fill_gradient()
此函數可以很好地表現出數量的差異，顏色越紅代表次數越多。
theme_bw()
一般來說畫出的圖背景是灰色，但是用theme_bw()可將背景變成白色，也可以變成其他，如theme_classic無背景顏色格線, theme_dark暗色… 等等。要注意theme_bw()要放在theme()函數前面，不然會把theme()裡面設定好的字型大小、排版取代掉
theme()
裡面的axis.text.x = element_text()可以設定x軸y軸的字型大小，垂直水平度數、字與x軸y軸的距離

最後最後，成品就如下：

Dcard爬蟲、每日發文量繪圖

Written by Ken.Y