Python英文自然語言處理-Stanford NLP安裝及測試

Pei Lee
15 min readMay 20, 2017

Stanford NLP是一個知名的自然語言處理套件,支援多國語言,包含中文。不過是由Java寫成,要在Python使用得多下一些工夫。

我的環境:Windows 7/10 + Python 3.6

一、安裝

  1. 安裝Java及Python NLTK套件:我裝的是Java 1.8和NLTK 3.2.1。
  2. 下載Stanford NLP:我裝的版本是 3.7.0。 Stanford Named Entity Recognizer (NER)Stanford Log-linear Part-Of-Speech TaggerStanford ParserStanford Word Segmenter
  3. 新增資料夾:新增一個StanfordNLP資料夾,下面再新增資料夾"jars”。
  4. 解壓縮並移動至jars目錄下:把stanford-ner-2016–10–31裡的stanford-ner.jar、stanford-postagger-full-2016–10–31裡的stanford-postagger.jar、stanford-parser-full-2016–10–31裡的slf4j-apistanford-parser-3.7.0-models.jarstanford-parser.jar、stanford-segmenter-2016–10–31裡的stanford-segmenter-3.7.0.jar移到StanfordNLP\jars\底下。

5. 移動POStagger裡的models至StanfordNLP目錄下:把stanford-postagger-full-2016–10–31裡面的models剪下貼到StanfordNLP\之下。

6. 移動NERtagger\classifier裡的檔案至StanfordNLP目錄下:把stanford-ner-2016–10–31\classifiers裡面的所有檔案(GZ檔案、PROP檔案)剪下,貼進StanfordNLP\models\底下。總共有47個檔案。

二、測試

  1. 先把要測試的東西都先import進來:
import os
from nltk.tokenize import StanfordTokenizer
from nltk.tag import StanfordNERTagger
from nltk.tag import StanfordPOSTagger
from nltk.parse.stanford import StanfordParser
from nltk import Tree
from nltk.parse.stanford import StanfordDependencyParser

2. 指定環境變數:

os.environ["JAVA_HOME"] = "C:/Program Files/Java/jdk1.8.0_111" 
os.environ["CLASSPATH"] = "C:(我的路徑)/StanfordNLP/StanfordNLP/jars"
os.environ["STANFORD_MODELS"] = "C:(我的路徑)/StanfordNLP/StanfordNLP/models"

3. 斷詞:

tokenizer = StanfordTokenizer()sent = "Everyone is entitled to all the rights and freedoms set forth in this Declaration, without distinction of any kind, such as race, colour, sex, language, religion, political or other opinion, national or social origin, property, birth or other status. From UN"print (tokenizer.tokenize(sent))

輸出結果:

['Everyone', 'is', 'entitled', 'to', 'all', 'the', 'rights', 'and', 'freedoms', 'set', 'forth', 'in', 'this', 'Declaration', ',', 'without', 'distinction', 'of', 'any', 'kind', ',', 'such', 'as', 'race', ',', 'colour', ',', 'sex', ',', 'language', ',', 'religion', ',', 'political', 'or', 'other', 'opinion', ',', 'national', 'or', 'social', 'origin', ',', 'property', ',', 'birth', 'or', 'other', 'status', '.', 'From', 'UN']

4.標註詞性:

eng_tagger = StanfordPOSTagger('english-bidirectional-distsim.tagger')print(eng_tagger.tag(sent.split()))

輸出結果:斷詞後標註好詞性

[('Everyone', 'NN'), ('is', 'VBZ'), ('entitled', 'VBN'), ('to', 'TO'), ('all', 'PDT'), ('the', 'DT'), ('rights', 'NNS'), ('and', 'CC'), ('freedoms', 'NNS'), ('set', 'VBN'), ('forth', 'RB'), ('in', 'IN'), ('this', 'DT'), ('Declaration,', 'NN'), ('without', 'IN'), ('distinction', 'NN'), ('of', 'IN'), ('any', 'DT'), ('kind,', 'NN'), ('such', 'JJ'), ('as', 'IN'), ('race,', 'FW'), ('colour,', 'FW'), ('sex,', 'FW'), ('language,', 'FW'), ('religion,', 'FW'), ('political', 'JJ'), ('or', 'CC'), ('other', 'JJ'), ('opinion,', 'NN'), ('national', 'JJ'), ('or', 'CC'), ('social', 'JJ'), ('origin,', 'FW'), ('property,', 'FW'), ('birth', 'NN'), ('or', 'CC'), ('other', 'JJ'), ('status.', 'NN'), ('From', 'IN'), ('UN', 'NNP')]

5. 命名實體識別:

eng_tagger = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')print(eng_tagger.tag(sent.split()))

輸出結果:

[('Everyone', 'O'), ('is', 'O'), ('entitled', 'O'), ('to', 'O'), ('all', 'O'), ('the', 'O'), ('rights', 'O'), ('and', 'O'), ('freedoms', 'O'), ('set', 'O'), ('forth', 'O'), ('in', 'O'), ('this', 'O'), ('Declaration,', 'O'), ('without', 'O'), ('distinction', 'O'), ('of', 'O'), ('any', 'O'), ('kind,', 'O'), ('such', 'O'), ('as', 'O'), ('race,', 'O'), ('colour,', 'O'), ('sex,', 'O'), ('language,', 'O'), ('religion,', 'O'), ('political', 'O'), ('or', 'O'), ('other', 'O'), ('opinion,', 'O'), ('national', 'O'), ('or', 'O'), ('social', 'O'), ('origin,', 'O'), ('property,', 'O'), ('birth', 'O'), ('or', 'O'), ('other', 'O'), ('status.', 'O'), ('From', 'O'), ('UN', 'ORGANIZATION')]

命名實體識別主要是用來識別人名、地名、組織機構名之類的,所以可以看到它在 'UN’ 標示了 ‘ORGANIZATION’。

6. 句法分析:

eng_parser = StanfordParser()
res = list(eng_parser.parse(sent.split()))
print(res)

輸出結果:

[Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NN', ['Everyone'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('ADJP', [Tree('VBN', ['entitled']), Tree('PP', [Tree('TO', ['to']), Tree('NP', [Tree('NP', [Tree('PDT', ['all']), Tree('DT', ['the']), Tree('NNS', ['rights']), Tree('CC', ['and']), Tree('NNS', ['freedoms'])]), Tree('VP', [Tree('VBN', ['set']), Tree('ADVP', [Tree('RB', ['forth']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['this']), Tree('ADJP', [Tree('JJ', ['Declaration,']), Tree('PP', [Tree('IN', ['without']), Tree('NP', [Tree('NP', [Tree('NN', ['distinction'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('DT', ['any']), Tree('NN', ['kind,'])]), Tree('PP', [Tree('JJ', ['such']), Tree('IN', ['as']), Tree('NP', [Tree('NNP', ['race,']), Tree('NNP', ['colour,']), Tree('NNP', ['sex,']), Tree('NNP', ['language,']), Tree('NNP', ['religion,']), Tree('JJ', ['political']), Tree('CC', ['or']), Tree('JJ', ['other']), Tree('NN', ['opinion,'])])])])])])])]), Tree('ADJP', [Tree('JJ', ['national']), Tree('CC', ['or']), Tree('JJ', ['social'])]), Tree('JJ', ['origin,']), Tree('NN', ['property,']), Tree('NN', ['birth']), Tree('CC', ['or']), Tree('JJ', ['other']), Tree('NN', ['status.'])])])]), Tree('PP', [Tree('IN', ['From']), Tree('NP', [Tree('NNP', ['UN'])])])])])])])])])])]

這結果有點醜,如果不畫圖來看的話,可視性不高。來畫成圖看看:

res[0].draw()

輸出結果:(圖太大,只擷取一部分)

7. 依存句法分析:

eng_parser = StanfordDependencyParser()
res = list(eng_parser.parse(sent.split()))
for row in res[0].triples():
print(row)

輸出結果:

(('entitled', 'VBN'), 'nsubjpass', ('Everyone', 'NN'))
(('entitled', 'VBN'), 'auxpass', ('is', 'VBZ'))
(('entitled', 'VBN'), 'nmod', ('rights', 'NNS'))
(('rights', 'NNS'), 'case', ('to', 'TO'))
(('rights', 'NNS'), 'dep', ('all', 'PDT'))
(('rights', 'NNS'), 'det', ('the', 'DT'))
(('rights', 'NNS'), 'cc', ('and', 'CC'))
(('rights', 'NNS'), 'conj', ('freedoms', 'NNS'))
(('rights', 'NNS'), 'acl', ('set', 'VBN'))
(('set', 'VBN'), 'advmod', ('forth', 'RB'))
(('forth', 'RB'), 'nmod', ('birth', 'NN'))
(('birth', 'NN'), 'case', ('in', 'IN'))
(('birth', 'NN'), 'det', ('this', 'DT'))
(('birth', 'NN'), 'amod', ('Declaration,', 'JJ'))
(('Declaration,', 'JJ'), 'nmod', ('distinction', 'NN'))
(('distinction', 'NN'), 'case', ('without', 'IN'))
(('distinction', 'NN'), 'nmod', ('kind,', 'NN'))
(('kind,', 'NN'), 'case', ('of', 'IN'))
(('kind,', 'NN'), 'det', ('any', 'DT'))
(('kind,', 'NN'), 'nmod', ('political', 'JJ'))
(('political', 'JJ'), 'case', ('such', 'JJ'))
(('such', 'JJ'), 'mwe', ('as', 'IN'))
(('political', 'JJ'), 'amod', ('race,', 'NNP'))
(('political', 'JJ'), 'amod', ('colour,', 'NNP'))
(('political', 'JJ'), 'amod', ('sex,', 'NNP'))
(('political', 'JJ'), 'amod', ('language,', 'NNP'))
(('political', 'JJ'), 'amod', ('religion,', 'NNP'))
(('political', 'JJ'), 'cc', ('or', 'CC'))
(('political', 'JJ'), 'conj', ('opinion,', 'NN'))
(('opinion,', 'NN'), 'amod', ('other', 'JJ'))
(('birth', 'NN'), 'amod', ('national', 'JJ'))
(('national', 'JJ'), 'cc', ('or', 'CC'))
(('national', 'JJ'), 'conj', ('social', 'JJ'))
(('birth', 'NN'), 'amod', ('origin,', 'JJ'))
(('birth', 'NN'), 'compound', ('property,', 'NN'))
(('birth', 'NN'), 'cc', ('or', 'CC'))
(('birth', 'NN'), 'conj', ('status.', 'NN'))
(('status.', 'NN'), 'amod', ('other', 'JJ'))
(('set', 'VBN'), 'nmod', ('UN', 'NNP'))
(('UN', 'NNP'), 'case', ('From', 'IN'))

結果的格式分別是((詞A, 詞A詞性), 詞A與詞B的依存關係, (詞B, 詞B詞性)),關於詞與詞之間的依存關係,這邊有詳細的官方說明。

以上就是Stanford NLP的基本安裝及功能測試。

--

--