NLTK 初學指南(二)：由外而內，從語料庫到字詞拆解 — 上手篇

Youngmi huang

Published in

PyLadies Taiwan

18 min readJul 30, 2018

本篇是『NLTK 初學指南』的第二集，主要介紹如何上手使用 NLTK 提供的 corpus，範圍包括：從語料庫查找文本 id 以及文本的分類屬性 → 查找特定字詞→ 文本斷詞以及斷句 → stopwords 字詞處理。

第一篇文：NLTK 初學指南(一)：簡單易上手的自然語言工具箱−探索篇

NLTK簡介

NLTK 全名是 Natural Language Tool Kit，是一套基於 Python 的自然語言處理工具箱。在官方文件的說明十分友善，主要是以下這個網頁版電子書： Natural Language Processing with Python ，章節如下圖：

Outline of Natural Language Processing with Python

詞彙篇萃取了第二章節 Accessing Text Corpora and Lexical Resource 的範例；如果喜歡實體工具書，也可以參考 O’Reilly 的版本。

一層一層剝開語料庫：你會發現，文章是由大大小小的「字詞」所組成

語料庫就如同一顆洋蔥，若要獲得最小單位的字詞，需要由外而內一層一層剝開。自然語言處理流程就如同剝洋蔥，剝的順序以及方法不一樣，產生的字詞所能代表的特徵意義就會不一樣。第二章整體的練習主要就是從 corpus 到 texts 的過程，如下所示：

NLTK 本身提供許多 corpus ，因上次安裝時已下載 all-corpora，因此若之前未曾安裝 all-corpora 的話，在練習範例時可透過 nltk.download()下載所需要的語料庫就可以了。以下就從第一步 corpus 開始！

corpus：使用 NLTK 提供的語料庫

範例包含了 4 個語料庫：gutenberg、brown、reuters、inaugural。

1. 查找語料庫當中的文本 id

corpus.fileids()

# 透過 fileid 可以找到該語料庫底下的文本有哪些
from nltk.corpus import gutenberg
gutenberg.fileids()# Result
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

gutenberg 是第一個提供免費的網路電子書平台，根據官方網站說明，project gutenberg 已經有超過 57,000本免費的電子書，NLTK 的 package 僅納入部分語料(如上述)。

2. 原始內容、單詞列表、句子列表

corpus.raw(fileids)、corpus.words(fileids)、corpus.sents(fileids)

# 以 gutenberg 語料庫當中的第一篇語料為例
gutenberg.raw('austen-emma.txt')
gutenberg.words('austen-emma.txt')
gutenberg.sents('austen-emma.txt')# Result: 原始內容 (partial)
'Emma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.' # Result: 單詞列表 (partial)
['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.']# Result: 句子列表 (partial)
[['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.'], 
['She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'", 's', 'marriage', ',', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', '.']]

乍看之下，上述 print 出來的結果，words 跟 sents 很像。事實上只有在範圍是一句話的時候，words 的效果跟 sents一樣，但若範圍擴大，在資料結構上是不一樣的： words 為一個 list ，裡面包含所有字詞； sents 是以一句話為單位包成一個 list。

同時也可以依 NLP 處理不同的需求，做字詞數或是句子數的計算：

# 字詞數/句子數的計算
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))   # 輸出文本原始內容
    num_words = len(gutenberg.words(fileid)) # 輸出文本單詞列表
    num_sents = len(gutenberg.sents(fileid)) # 輸出文本句子列表
    print(num_chars, num_words, num_sents, fileid)# 前3筆Result：原始內容長度、字詞數、句子數
887071 192427 7752 austen-emma.txt
466292 98171 3747 austen-persuasion.txt
673022 141576 4999 austen-sense.txt

若我們想要比較不同文本之間的：平均字詞長度、平均句子長度，以及詞彙多樣性（總字詞數/相異字詞數），可參考以下計算：

for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))   # 輸出文章原始內容
    num_words = len(gutenberg.words(fileid)) # 輸出文章單詞列表
    num_sents = len(gutenberg.sents(fileid)) # 輸出文章句子列表
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))          # 計算平均字詞長度、平均句子長度、詞彙多樣性
    print(round(num_chars/num_words), 
          round(num_words/num_sents), 
          round(num_words/num_vocab), fileid)# 前3筆Result 
5 25 26 austen-emma.txt
5 26 17 austen-persuasion.txt
5 28 22 austen-sense.txt

3. 語料庫內文本的分類屬性

corpus.categories()

# 以 brown 語料庫為例
from nltk.corpus import brown
brown.categories()# Result
['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']# 透過查找文本id的方式：得知brown有一篇文本id為：cc01
# 查詢cc01的分類
brown.categories("cc01")# Result
['reviews']

brown 語料庫內的文本可以分成 adventure、belles_lettres…etc 這15種分類屬性，透過前面查文本 id 的方式得知其中一篇文本id為 cc01，查詢該文本的分類屬性及可得到 cc01 的分類屬性是「review (評論類)」。

brown 語料庫是第一個百萬等級的電子語料庫(英文)， 1961 年由 Brown University 所整理，這個語料庫包含的字詞來自 500 個資料源，並參考資料源的種類做分類，例如：adventure 、news、reviews…等。

有時候單一文本不只有一種分類，如 reuters 語料庫中， test/14833 的分類屬性為「palm-oil (棕梠油)」與「veg-oil (蔬菜油)」。

# 第一步: 先查詢文本id
from nltk.corpus import reuters
reuters.fileids()# 前3筆Result
['test/14826', 'test/14828', 'test/14829']# 查詢分類
reuters.categories('test/14833')
# Result
['palm-oil', 'veg-oil']

reuters 是路透社語料庫，涵蓋 10,788 個新聞文本，共有 90 個分類，例如：housing、income、tea…等。

4. 在語料庫中尋找特定字詞

# 從 brown 語料庫中找尋評論類文本的字詞
brown.words(categories='reviews')# Result (Partial)
['It', 'is', 'not', 'news', 'that', 'Nathan', 'Milstein', 'is', 'a', 'wizard', 'of', 'the', 'violin', '.']

5. 結合分類與特定字詞的應用

(1) 使用 FreqDist() 在單一文本中計算特定字詞個數：

# 在評論類文本裡面，計算情境助動詞的個數
reviews_text = brown.words(categories='reviews')
fdist = nltk.FreqDist(w.lower() for w in reviews_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', fdist[m], end=' ')# Result
can: 45 could: 40 may: 47 might: 26 must: 19 will: 61

(2) 使用 ConditionalFreqDist() 在多個文本中計算特定字詞個數

使用的方式是一個配對列表 (篩選條件, 被篩選的事件)：

# 加入多個文本，計算情境助動詞的個數
cfd = nltk.ConditionalFreqDist((genre, word) 
    for genre in brown.categories()
    for word in brown.words(categories=genre))genres = ['reviews', 'hobbies', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)# Result
          can  could  may  might  must  will 
reviews    45    40    45    26    19    58 
hobbies   268    58   131    22    83   264 
romance    74   193    11    51    45    43 
  humor    16    30     8     8     9    13

例如：篩選條件為分類是reviews ，被篩選的事件為在 brown 文本當中出現的字詞 words ，以矩陣的方式作視覺化呈現。

NLTK 也有的斷詞功能：tokenize

一般我們可以透過 split() ，先做到簡單的斷詞，看一個簡單的範例：

# 英文範例
sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""# 使用 split
sentence.split()# Result
['At',
 'eight',
 "o'clock",
 'on',
 'Thursday',
 'morning',
 '...',
 'Arthur',
 "didn't",
 'feel',
 'very',
 'good.']

有發現嗎？上述斷詞後的結果出現 didn't、 good. 的字詞，分別在英文縮寫以及帶有句點的地方無法斷開。原因為 split 僅能利用空格做為分隔符號進行斷詞。而 NLTK 提供了 word_tokenize() 的功能可以解決此問題。

# 使用 tokenize 進行斷詞
nltk.word_tokenize(sentence)

來看一下兩種方法斷詞的結果比較：

使用 word_tokenize 可以進行斷詞，而 sent_tokenize 可以幫助我們斷句，以 inaugural 語料庫為例：

# 第一步: 先查詢文本id
from nltk.corpus import inaugural
inaugural.fileids()# 後3筆Result
['2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']# 輸入文本原始內容並進行斷句
sent = inaugural.raw('1789-Washington.txt')
nltk.sent_tokenize(sent)# Result (partial)
[['My fellow citizens:\n\nI stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors.',
 'I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.',
 'Forty-four Americans have now taken the presidential oath.']

inaugural 是歷屆美國總統就職演說的語料庫，文本的命名方式是『年份+人名』，共有 56 個文本，最新一筆收錄的是 2009 年 Obama 的演說稿。

歷屆的美國總統的演說所使用到的字詞到底有什麼變化呢？透過篩選以 citizen、 america 為首的字詞在歷屆總統演說文本中出現的次數，可以發現：自 1789 ~1905 期間，總統的演說內容仍以 citizen 為主，而在 1905 之後，歷屆總統在提到「america」的次數已經有高於「citizen」的現象存在：

# 使用條件篩選
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids() # 取出年份
    for w in inaugural.words(fileid)  # 歷屆文本的字詞
    for target in ['america, 'citizen'] # 篩選字詞
    if w.lower().startswith(target)) # 字詞 American’s 也能納入計算
plt.figure(figsize=(20,10))
cfd.plot()

停用字 (stopwords) 的使用

NLTK 的 stopwords 語料庫支援了 21 種語言，但仍以英文為主，只要到當初下載 NLTK 的路徑底下，進到 corpora/stopwords 資料夾就可以看到。

# 引用 NLTK 英文版本的 stopwords
from nltk.corpus import stopwords
print(stopwords.words('english'))# Result (partial)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves'.....]# 英文版本的 stopwords 以代名詞、介系詞為主，總共僅179個停用字。

接著以上述「歷屆美國總統就職演說語料庫」為例，計算不含 stopwords 的比例，作為 stopwords 的應用範例：

# 定義函式：計算 inaugural 語料庫不含 stopwords 的比例
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content)/len(text)content_fraction(inaugural.words())# Result
0.5228599855902837

stopwords 為字詞處理的第一步，NLTK 當中的 stopwords ，在各語言版本的維護情況不一，若要使用最好還是先試試看效果如何，中文的話需要自己定義 stopwords、或是在引用 jieba 進行斷詞更為方便。

所以洋蔥剝完了嗎?

小結

字詞處理的方法有很多，其本身意義、出現頻次、所在位置等都是反映文本含義的重要資訊。下一集會繼續介紹「字詞處理」的方法，探討字詞在整篇文本當中的上下位關係、同義詞、文法的處理（動詞時態、名詞單複數），以及 WordNet 這個強大的語料網絡庫，因此，下一集會繼續介紹洋蔥該怎麼剝得更漂亮！

資料來源

如果這篇文章有幫助到你，可以幫我在下方綠色的拍手圖示按5下，只要登入Google或FB，不需任何花費就能【免費支持】youmgmi 繼續創作。