使用NLTK和SpaCy命名實體識別(Named Entity Recognition)

NER用於自然語言處理（NLP）的許多領域

Frederick Lee

10 min readJul 2, 2019

根據網路上的文章翻譯
文章來源： https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

命名實體識別（NER）可能是信息提取的第一步，旨在將文本中的命名實體定位和分類為預定義的類別，例如人員名稱，組織，地點，時間表，數量，貨幣價值，百分比等等。NER用於自然語言處理（NLP）的許多領域，它可以幫助回答許多現實問題，例如：

新聞文章中提到了哪些公司？
投訴或評論中提到的指定產品是？
這條推文是否包含一個人的名字？這條推文是否包含此人的位置？

本文介紹如何使用NLTK和SpaCy構建命名實體識別器，以識別事物的名稱，例如原始文本中的人員，組織或位置。讓我們開始吧！

NLTK

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

信息提取

我使用“紐約時報 ” 的一句話，“European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices.”

ex = 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

然後將單詞標記化和詞性標註應用於句子。

def preprocess(sent):
    sent = nltk.word_tokenize(sent)
    sent = nltk.pos_tag(sent)
    return sent

得到了以下結果：

sent = preprocess(ex)
sent

我們得到一個元組列表，其中包含句子中的單個單詞及其相關的詞性。

現在，我們將實現名詞短語Chunk，以使用正則表達式來識別命名實體，該正則表達式包含指示句子應如何Chunk的規則。

我們的Chunk模式由一個規則組成，每當Chunk找到一個可選的限定詞DT，然後是任意數量的形容詞JJ，然後是名詞NN時，應該形成名詞短語NP。

pattern = 'NP: {<DT>?<JJ>*<NN>}'

Chunking( 名詞組的辨識與標示)

使用這種模式，我們創建一個塊解析器並在我們的句子上測試它。

cp = nltk.RegexpParser(pattern)
cs = cp.parse(sent)
print(cs)

輸出可以作為樹或層次結構讀取，S作為第一級，表示句子。我們也可以用圖形方式顯示它。

IOB標籤已經成為表示文件中塊結構的標準方式，我們也將使用這種格式。

from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprintiob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

在此表示中，每行有一個標記，每個標記具有其詞性標記及其命名實體標記。基於這個訓練語料庫，我們可以構建一個可用於標記新句子的標記器; 並使用nltk.chunk.conlltags2tree（）函數將標記序列轉換為塊樹。

使用函數nltk.ne_chunk（），我們可以使用分類器識別命名實體，分類器添加類別標籤，如PERSON，ORGANIZATION和GPE。

ne_tree = ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

谷歌被認為是一個人。這是非常令人失望的，你不這麼認為嗎？

SpaCy

SpaCy的命名實體識別已經在OntoNotes 5語料庫上進行了訓練，它支持以下實體類型：

Entity(命名)

import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

我們使用相同的句子，“European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices.”

Spacy的一個好處是我們只需要應用nlp一次，整個後台管道將返回對象。

doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
pprint([(X.text, X.label_) for X in doc.ents])

歐洲是NORD（國籍或宗教或政治團體nationalities or religious or political groups），谷歌是一個組織，51億美元是貨幣價值，週三是日期對象。他們都是正確的。

Token

在上面的示例中，我們在實體級別上工作，在以下示例中，我們使用BILUO標記方案演示令牌級實體註釋來描述實體邊界。

pprint([(X, X.ent_iob_, X.ent_type_) for X in doc])

"B"表示令牌開始實體，"I"意味著它在實體內部，"O"意味著它在實體外部，並且""意味著沒有設置實體標籤。

從文章中提取命名實體

現在讓我們認真對待SpaCy並從“紐約時報”的一篇文章中提取命名實體，“F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired.”

from bs4 import BeautifulSoup
import requests
import redef url_to_string(url):
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    for script in soup(["script", "style", 'aside']):
        script.extract()
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))ny_bb = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=first-column-region&region=top-news&WT.nav=top-news')
article = nlp(ny_bb)
len(article.ents)

188

文章中有188個實體，它們表示為10個唯一標籤：

labels = [x.label_ for x in article.ents]
Counter(labels)

以下是三種最常見的Token。

items = [x.text for x in article.ents]
Counter(items).most_common(3)

Figure 11

讓我們隨機選擇一個句子來做更多了解。

sentences = [x for x in article.sents]
print(sentences[20])

Figure 12

讓我們運行displacy.render 生成原始標記。

displacy.render(nlp(str(sentences[20])), jupyter=True, style='ent')

這裡有一個錯過的分類是FBI這很難，不是嗎？

使用spaCy的內置displaCy可視化器，以下是上述句子及其依賴關係的樣子：

displacy.render(nlp(str(sentences[20])), style='dep', jupyter = True, options = {'distance': 120})

接下來，我們逐字逐句地提取詞性並將這個句子解釋。

[(x.orth_,x.pos_, x.lemma_) for x in [y 
                                      for y
                                      in nlp(str(sentences[20])) 
                                      if not y.is_stop and y.pos_ != 'PUNCT']]

dict([(str(x), x.label_) for x in nlp(str(sentences[20])).ents])

除“FBI”外，命名實體提取是正確的。

print([(x, x.ent_iob_, x.ent_type_) for x in sentences[20]])

最後，我們可視化整篇文章的實體。

Try it yourself. It was fun! Source code can be found on Github.