NLTK 初學指南(一)：簡單易上手的自然語言工具箱－探索篇

Published in

PyLadies Taiwan

14 min readJun 4, 2018

NLTK是什麼？

NLTK 全名是 Natural Language Tool Kit，是一套基於 Python 的自然語言處理工具箱。在官方文件的說明十分友善，主要是以下這個網頁版電子書： Natural Language Processing with Python ，章節如下圖：

Outline of Natural Language Processing with Python

這次的探索篇萃取了第一章節 Language Processing and Python 的範例；如果喜歡實體工具書，也可以參考 O’Reilly 的版本。

環境設定

主要使用 Python3 、NTLK3、還有 jieba。本篇的主角是要安裝 NLTK：

# 安裝 NLTK
pip install nltk# 安裝 NLTK 相關套件會出現以下介面
nltk.download()

由於探索篇會使用到 nltk.book ，可以視個人需要進行安裝。這次我是安裝了 all-corpora 以及 popular package。下圖是下載 all-corpora 的畫面，會需要比較久的時間。

下載完成之後，就可以輸入 q 離開， nltk就會返回 True ，確認離開下載。

NLTK基本功能介紹

由於 NLTK 本身就是一個以自然語言處理為名的工具箱，因此可以很方便地透過工具箱去使用前面預先下載好的文本，讓我們能在第一章做練習，以下 nltk.book共有 9 篇文本：

# 首先引用 nltk 提供的預設文本
from nltk.book import *
# Result
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

1. 搜尋字詞：顯現字詞出現的上下文

book.concordance()

這個功能除了能初步探勘字詞在文本中的語義之外，也可以應用在字詞檢索的任務當中。比方說我們想要在 text3: The book of Genesis 這本書當中找到「人們活多久」這個問題，於是可以搜尋 lived 這個單字出現在文本的哪些句子中（但每一行呈現的字數有限），透過 lived 的上下文判斷語意：Adam 大約活了 130 年、Seth 活了 150年、Enos 活了 90 年…等；另外，輸入的關鍵字詞與搜尋結果都有「不分英文大小寫」的特性。

# 搜尋字詞功能
text3.concordance("lived")
text3.concordance("LIVED")# Result (省略一些，僅列前5項)
Displaying 25 of 38 matches:
ay when they were created . And Adam lived an hundred and thirty years , and be
nd thirty yea and he died . And Seth lived an hundred and five years , and bega
welve years : and he died . And Enos lived ninety years , and begat Cainan : An
ive years : and he died . And Cainan lived seventy years and begat Mahalaleel :
years : and he died . And Mahalaleel lived sixty and five years , and begat Jar
and five yea and he died . And Jared lived an hundred sixty and two years , and
...

2. 找近似字

book.similar()、book.common_contexts()

根據該詞的上下文，找到類似結構，就認定他們為近似字。假設我們現在要在 text1 裡找 monstrous 字詞，而 monstrous 會出現在 the ___ pictures 以及 a ___ size 這樣的結構當中，透過這個方法去比對，一旦以下字詞( true、 contemptible 、 christian ) 會在 text1 文本出現在一樣的結構中，就認定他們為近似字。

# 近似字
text1.similar("monstrous")# Result
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless
# 回頭檢視結構
text1.common_contexts(["monstrous","abundant"])# Result
most_and

3. 詞彙多樣性

len(set(book))/ len(book)

透過計算「相異字詞長度/總字詞長度」的值，去比較不同文本之間涵蓋詞彙的豐富程度。以 text4 為例，透過 set(text4) ，可以獲得 text4 文本所有的相異字詞，像是 1812 、1815 、Amendment 、Abandonment 、Above 、Accept 、Americans …等， len(set(text4))則為相異字詞長 ( 9,754 ) ， len(text4) 為總字詞長度 ( 145,735 )，兩者相除後計算出來的值為 0.0623。實際執行可以發現排序後的相異字詞，有很多皆為年份，相異字詞的內容多和法律制定有關。

# 相異字詞
set(text4)# 相異字詞排序
sorted(set(text4))# 定義詞彙多樣性的函數
def lexical_diversity(text):
    return len(set(text)) / len(text)
lexical_diversity(text4)# Result 
0.06230453042623537

4. 詞彙分布圖

book.dispersion_plot()

延續上面跟制定法律有關的 text4 文本，如果我們想要檢視「制定美國民主」相關的字詞出現在整篇的頻率，也就是特定字詞出現在文本的前、中、後的狀況：

# 構造文本的詞彙分佈圖
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America", "liberty", "constitution"])

從這可以發現一些脈絡有跡可循， citizens雖然全篇都出現，但主要集中在前半段， America 在後半段才大量出現。其實這蠻接近我們對於法律的結構的認知，首先會先闡述公民與權利的大原則，最後才會奠基於大原則之上， America 會怎麼做。

5. 文本的結合

sent1 + sent2

這是兩個短文本結合的範例， sent1 、 sent2 都是前面引用自 ntlk.book ，可直接使用練習，也可以是 n 個文本的結合，蠻直覺的。

# sent1 與 sent2 內容
sent1 
> ['Call', 'me', 'Ishmael', '.']sent2 
> ['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']# 文本的結合
sent1 + sent2# Result
['Call', 'me', 'Ishmael', '.', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']

中文文本也可以使用 NLTK

之前曾在以 jieba 與 gensim 探索文本主題：五月天人生無限公司歌詞分析 ( I ) 一文當中，使用 jieba 執行中文斷詞與特徵萃取的任務。以下延續中文歌詞文本的使用（格式可以參考：mayday.txt、single.txt），搭配 jieba，結合上述介紹 NLTK 的基本功能。

1. 搜尋字詞

這裡使用的 mayday.txt 為五月天人生人生無限公司巡迴演唱會的歌單，共 33 首歌詞。透過jieba.analyse.extract_tags 取出中文文本當中的關鍵字，背後是使用 tf — idf 的概念，詳細介紹可以參考歌詞分析一文。

# 搜尋字詞
import jieba.analysewith open("data/mayday.txt") as f1:
    for line in f1:
        lyrics= nltk.text.Text(jieba.analyse.extract_tags(line))
        lyrics.concordance("我們")# Result
Displaying 1 of 1 matches:
我們 不曾 哪裡 相識 那個 場景 出現 相遇 每秒 那一刻 我會 充滿 如果 
Displaying 1 of 1 matches:
我們 好好 變老 時間 喧囂 最後的 知道 大人 失散多年 寫成 想養 每個 場
Displaying 1 of 1 matches:
我們 人生 快樂 無論 一個 期待 然後呢 回憶 也許 走著 並肩 追尋了 親愛
....

2. 近似字

延續使用 mayday.txt 文本，jieba.cut 是做中文斷詞， nltk.text.Text 讓文本成為 NLTK 可以吃的格式；透過 nltk.similar("我們")，找到「我們」的近似字包含：「我」、「青春」、「身邊」…等；近一步回頭去看「我」跟「我們」會出現在『讓_不』這樣的結構當中，因此他們被判定為近似字。

# 近似字
import jieba
raw = open("data/mayday.txt").read() 
lyrics = nltk.text.Text(jieba.cut(raw))
lyrics.similar("我們")
lyrics.common_contexts(["我們","我"])# Result
similar: 
我 青春 身邊 完美 難題 倔強 地方 街 路 陷阱 空氣 眼中 憤青 剝開 破壞
common_contexts: 
讓_不

3. 詞彙多樣性

這裡使用的 single.txt 為單一歌詞文本 (五月天的戀愛 ing )，透過前面介紹的函式，每一首歌詞都可以去計算詞彙豐富度，len(set(single)) 為戀愛 ing 這首歌詞的相異字詞長度( 76)， len(single) 為總字詞長度( 244) ，兩者相除得到詞彙多樣性的值為 0.31 。

# 引用基本功能介紹：詞彙多樣性的 function
raw = open("data/single.txt").read() 
single = nltk.text.Text(jieba.cut(raw))
lexical_diversity(single)# Result
0.3114754098360656

4. 詞彙分布圖

延續使用 single.txt 單一歌詞文本，將「戀愛」、「ing」、「happy」、「love」這幾個副歌字詞用 single.dispersion_plot 做呈現，可以看到關鍵歌詞之間的先後順序，以及頻率分佈。

# 呈現中文
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5)) 
plt.rcParams['font.sans-serif'] = 'SimHei'# 詞彙分佈圖
single.dispersion_plot(["love","戀愛","ing","happy"])

NLTK 的畫圖是引用自 matplotlib 的套件，因此一般情況下，左邊縱軸的中文會是亂碼，有以下兩種解法：

解法一

每次都需要執行 plt.rcParams['font.sans-serif'] = ‘SimHei’ 的參數設定，但前提是需下載 SimHei 這個字體，並將 SimHei.ttf 放到你的 matplotlib 資料夾，讓畫圖時可以引用。

解法二

一勞永逸的在 matplotlibrc 設定參數，之後不需要每次執行額外的程式，步驟如下：

下載 SimHei 這個字體，並將 SimHei.ttf 放到你的 matplotlib 資料夾
到文件 matplotlibrc (在 matplotlib/mpl-data/fonts 目錄下面可以找到)，裡面修改下面三項配置：

> font.family : sans-serif
>font.sans-serif : SimHei, Bitstream Vera Sans, Lucida Grande, Verdana, Geneva, Lucid, Arial, Helvetica, Avant Garde, sans-serif
>axes.unicode_minus : False # 解決負號(-) 顯示為方塊的問題

3. 執行 _rebuild() 重置後，就可以正常呈現中文了！

from matplotlib.font_manager import _rebuild
_rebuild()

小結

NLTK 與 jieba 都是在處理自然語言任務當中最常被使用到的套件，自然語言最基本的構造為字詞、詞性標註、句法等。以下為個人小心得：

若單從功能面來看：NLTK 背後支持的廣大社群和相關資料，十分適合拿來練功自然語言領域，同時，它的生態是相對成熟的 (從內建文本、甚至還有人寫好 nltk 與爬 twitter 文本的功能)。
若從使用情境來看：英文使用NLTK，中文使用 jieba，必要的時候可以截長補短搭配使用。原因如下：NLTK 的開源社群主要是非中文語系(不限英文、也有其他語言)，jieba 則是以中文分詞為主出發的，因此在中文分詞的精確度上仍舊 jieba 是分得較好的；NLTK 同樣也有分詞功能 tokenize、字典以及詞義詞性的功能庫 wordnet ，會在下一篇介紹。

本篇程式碼已整理在 github，可參考 01_nltk_practice.ipynb 進行實作練習

資料來源

如果這篇文章有幫助到你，可以幫我在下方綠色的拍手圖示按5下，只要登入Google或FB，不需任何花費就能【免費支持】youmgmi 繼續創作。

NLTK 初學指南(一)：簡單易上手的自然語言工具箱－探索篇

NLTK是什麼？

環境設定

NLTK基本功能介紹

1. 搜尋字詞：顯現字詞出現的上下文

2. 找近似字

3. 詞彙多樣性

4. 詞彙分布圖

5. 文本的結合

中文文本也可以使用 NLTK

1. 搜尋字詞

2. 近似字

3. 詞彙多樣性

4. 詞彙分布圖

小結

資料來源

Written by Youngmi huang