探索主流深度學習框架在 GitHub 的活躍程度

網頁資料自動擷取（爬蟲）與視覺化

Published in

數聚點文摘

13 min readNov 30, 2017

想要開始學習深度學習的相關理論與程式實作，面對百家爭鳴的深度學習框架：TensorFlow、Pytorch、Caffe 或者 CNTK…等應該從哪裡著手？讓我們以 Selenium with Python 搭配視覺化探索 GitHub 上面主流框架的活躍程度！

TensorFlow 實戰 Google 深度學習框架這本書在第一章深度學習簡介中有一個段落，作者比較在 2016–10–15 至 2016–11–15 這段期間不同深度學習框架在 GitHub 的統計資訊，包含 issues、stars、pull requests 與 forks 個數，由於在這四個指標上 TensorFlow 都遠遠超過其他框架，因此選擇 TensorFlow 作為介紹對象。我希望能夠透過網頁資料自動擷取（爬蟲）與視覺化技巧做出類似作者的調查，藉此提供給對 Python 有興趣的初學者一個簡單範例。

成為 DataInPoint 的贊助者

這篇文章所使用的程式與圖形都可以在這個 Notebook 找到。

Photo by Beantin webbkommunikation on VisualHunt / CC BY-SA

Selenium 簡介

Selenium 是為了達到瀏覽器自動化而誕生的工具，讓程式可以直接驅動瀏覽器模擬使用者與網站的互動操作，過程中會真實執行瀏覽器像是 Chrome、Edge、Firefox 與 Safari 做出填寫表單與點選按鈕進而獲取網站即時的內容；目前在 Python 與 R 都能夠驅動 Selenium。

建立虛擬環境

建立一個虛擬環境命名為 seleniumpy，以 Anaconda 3 在 macOS 上的操作為例：conda create -n seleniumpy python=3.6

啟動虛擬環境，並在這個虛擬環境之下安裝需要的模組或者套件：

source activate seleniumpy
conda install -c conda-forge selenium numpy pandas matplotlib

安裝完畢以後，可以使用 conda list 觀察在這個虛擬環境之下安裝的模組與套件：

selenium、numpy、pandas、matplotlib 與其他相依套件都安裝好了

最後是將這個虛擬環境掛到 jupyter notebook 的 kernel 上，如此一來就能順利開啟特定虛擬環境的 notebook：

pip install ipykernel
python -m ipykernel install --user --name seleniumpy --display-name "seleniumpy"

網頁資料自動擷取（爬蟲）

整理一下擷取資料的流程：

連結到 GitHub
清空 Search GitHub
在 Search GitHub 中輸入深度學習框架的名稱
按下 Enter 開始搜尋
點選 Commits、Issues、Wikis 與 Users 等頁籤，並將個數紀錄起來

重複這幾個動作直到將所輸入的深度學習框架資訊擷取完畢。

接著盤點這些動作會用到的 Selenium with Python 方法：

Chrome()：用 Selenium 開啟 Chrome 瀏覽器
get()：連結到 GitHub
find_element_by_css_selector()：定位到搜尋欄位
send_keys()：輸入深度學習框架的名稱
send_keys(Keys.RETURN)：按下 Enter 開始搜尋

Selenium with Python 提供多種定位網頁元素的方法，像是 find_element_by_id()、find_element(s)_by_name()與 find_element(s)_by_xpath() 等，我喜歡使用 CSS Selector 或 XPath 來定位網頁元素，Chrome 有兩個很棒的外掛能協助：SelectorGadget 與 XPath Helper，使用方法也非常簡易。

讓我們操作一個簡單的練習：連結到 GitHub、搜尋 TensorFlow、點選 Commits 頁籤最後將 commits 的個數回傳。

from selenium import webdriver
from selenium.webdriver.common.keys import Keyssearch_css = ".js-site-search-focus"
result_css = ".pb-3 h3"
nav_css = ".UnderlineNav-item"
driver = webdriver.Chrome()
driver.get("https://github.com/")
search_input = driver.find_element_by_css_selector(search_css) # 找到搜尋欄位
search_input.clear() # 清空搜尋欄位
search_input.send_keys('TensorFlow') # 搜尋 TensorFlow
search_input.send_keys(Keys.RETURN) # 按下 Enter
navs = driver.find_elements_by_css_selector(nav_css) # 擷取不同 navs 的連結
driver.get(navs[2].get_attribute('href')) # 前往 Commits 頁籤
commits = driver.find_element_by_css_selector(result_css).text
print(commits)

不知道大家看著 Selenium 驅動瀏覽器跟網站互動會不會覺得很療癒？接著只要稍微清理一下搜尋結果就好：

splitted_commits = commits.split(" ")
commits_str = splitted_commits[0].replace(",", "")
commits_int = int(commits_str)
print(commits_int)

接著將這個簡單練習所做的事情包裝為一個函數 get_github_activities()：

最後，我們任意挑選 Caffe、Deeplearning4j、CNTK、Pytorch、MXNet、TensorFlow 與 Keras 這幾個框架作為函數的輸入：

dl_activities = get_github_activities(['Caffe', 'Deeplearning4j', 'CNTK', 'Pytorch', 'MXNet', 'TensorFlow', 'Keras'])
dl_activities

靜態視覺化

比較類別的排名適合使用長條圖（Bar chart），整理成資料框之後作圖非常方便，預設會以 dodge 的形式呈現：

import matplotlib.pyplot as pltdl_activities.plot.bar(x = "keyword", title = "GitHub Activities")
plt.show()

如果希望以 stacked 形式呈現，只需要加入 stacked = True 這個參數即可：

dl_activities.plot.bar(x = "keyword", title = "GitHub Activities", stacked = True)
plt.show()

加入 subplots = True 與 layout = (2, 2) 參數可以用一個 2x2 的網格畫布展開長條圖，四種活動的活躍程度一目暸然

dl_activities.plot.bar(x = "keyword", subplots = True, layout = (2, 2), legend=False)
plt.tight_layout()
plt.show()

假如想改為水平長條圖，改用 barh() 方法即可，但是由於 sharex 參數預設為 True，數值低的會跟數值高使用相同級距的 X 軸，造成下圖的情況：

dl_activities.plot.barh(x = "keyword", subplots = True, layout = (2, 2), legend=False)
plt.tight_layout()
plt.show()

要修正這個問題也相當簡單，加入 sharex = False 與 sharey = True 這兩個參數即可：

dl_activities.plot.barh(x = "keyword", subplots = True, layout = (2, 2), legend=False, sharex = False, sharey = True)
plt.tight_layout()
plt.show()

動態視覺化

利用這個機會練習 plotly 套件做一個有下拉式選單（Dropdown menu）的動態長條圖；回到 Terminal 啟動虛擬環境 seleniumpy，並在這個虛擬環境之下安裝 plotly：

source activate seleniumpy
pip install plotly

在 Jupyter Notebook 中繪畫 plotly 圖形特別注意要在一開始執行 plotly.offline.init_notebook_mode()，接著連結四個長條圖與下拉式選單：

如果您喜歡這篇文章，請多按下方的「拍手」圖像幾次、分享到社群網站、成為我們的贊助者以及訂閱 DataInPoint 的新文章！

DataInPoint is creating Data Science Tutorials | Patreon

Become a patron of DataInPoint today: Read posts by DataInPoint and get access to exclusive content and experiences on…

www.patreon.com

Learn Python for Data Science - Online Course

DataCamp's Intro to Python course teaches you how to use Python programming for data science with interactive video…

www.datacamp.com

Introduction to Data Visualization with Python

Learn more complex data visualization techniques using Matplotlib and Seaborn.

www.datacamp.com

Selenium with Python - Selenium Python Bindings 2 documentation

This is not an official documentation. If you would like to contribute to this documentation, you can fork this project…

selenium-python.readthedocs.io

Visualization - pandas 0.21.0 documentation

We provide the basics in pandas to easily create decent looking plots. See the section for visualization libraries that…

pandas.pydata.org

plotly

Plotly's Python graphing library makes interactive, publication-quality graphs online. Examples of how to make line…

plot.ly