http://www.lovejustine.com/journal/whats-in-a-name

Introduction to Named Entity Recognition

A tool which invariably comes handy when we do Natural Language Processing tasks

Introduction

In this article we will learn what is Named Entity Recognition also known as NER. We will discuss some of its use-cases and then evaluate few standard Python libraries using which we can quickly get started and solve problems at hand.

In the next series of articles we will get under the hood of this class of algorithms, get more sophisticated and will create our own NER from scratch.

So, let’s begin this journey.

What is Named Entity Recognition ?

Named Entity Recognition, also known as entity extraction classifies named entities that are present in a text into pre-defined categories like “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.

Few Use-Cases of Named Entity Recognition

  • Classifying content for news providers —
Classifying content for news providers

Named Entity Recognition can automatically scan entire articles and reveal which are the major people, organizations, and places discussed in them. Knowing the relevant tags for each article help in automatically categorizing the articles in defined hierarchies and enable smooth content discovery.

https://www.paralleldots.com/named-entity-recognition
  • Efficient Search Algorithms —
Efficient search across the brands

Let’s suppose you are designing an internal search algorithm for an online publisher that has millions of articles. If for every search query the algorithm ends up searching all the words in millions of articles, the process will take a lot of time. Instead, if Named Entity Recognition can be run once on all the articles and the relevant entities (tags) associated with each of those articles are stored separately, this could speed up the search process considerably. With this approach, a search term will be matched with only the small list of entities discussed in each article leading to faster search execution.

  • Customer Support —
Customer support on Twitter

Say you are handling the customer support department of an electronic store with multiple branches worldwide, you go through a number mentions in your customers’ feedback. Like this for instance

Now, if you pass it through the Named Entity Recognition API, it pulls out the entities Bandra (location) and Fitbit (Product). This can be then used to categorize the complaint and assign it to the relevant department within the organization that should be handling this.

https://www.paralleldots.com/named-entity-recognition

Standard Libraries to use Named Entity Recognition

I will discuss three standard libraries which are used a lot in Python to perform NER. I am sure there are many more and would encourage readers to add them in the comment section.

  1. Standford NER
  2. spaCy
  3. NLTK

Standford NER

Standford NER

Stanford NER is a Java implementation of a Named Entity Recognizer. Stanford NER is also known as CRFClassifier. The software provides a general implementation of (arbitrary order) linear chain Conditional Random Field (CRF) sequence models. That is, by training your own models on labeled data, you can actually use this code to build sequence models for NER or any other task.

Now NLTK (Natural Language Toolkit) is a great Python package that provides a set of natural languages corpora and APIs of wide varieties of NLP algorithms. NLTK comes along with the efficient Stanford NER implementation.

Now with this background, let’s use Stanford NER -

  • Install NLTK library-
pip install nltk
  • Download Stanford NER library

Go to https://nlp.stanford.edu/software/CRF-NER.html#Download

and download the latest version, I am using Stanford Named Entity Recognizer version 3.9.2.

I get a zip file which is called “stanford-ner-2018–10–16.zip” which needs to be unzipped and I renamed it to stanford_ner and placed it in the home folder.

Now the following Python code is written to perform the NER on some given text. The code is placed in the “bsuvro” folder, so that I can use the relative path to access the NER tagger engine (stanford-ner-3.9.2.jar) and NER model trained on the English corpus (classifiers/english.muc.7class.distsim.crf.ser.gz). You can see I am using 7class model which will give seven different output named entities like Location, Person, Organization, Money, Percent, Date, Time.

You can also use —

  • english.all.3class.distsim.crf.ser.gz: Location, Person and Organization
  • english.conll.4class.distsim.crf.ser.gz: Location, Person, Organization and Misc
Stanford Named Entity Recognition

The output of the above code is below and you can see how the words are tagged as named entities. Note “O” is something which is not tagged or can be called as “Others”.

Output of the Stanford NER tagger

Now, let’s move to the next library called spaCy.

spaCy

spaCy NER

spaCy is known for industrial-strength natural language processing library in Python. It has been written in Cython which is a superset of Python programming language with C-like performance.

Although I wish to go in details about spaCy as it has lot of interesting NLP modules, but I will focus here on the NER tagging. I will definitely have a separate series on exploring spaCy.

  • Install spaCy library and download the “en” (English) model -
pip install spacy
python -m spacy download en
spaCy NER

The output of the above code -

Output from spaCy NER

Now this supports following Entity types-

https://spacy.io/api/annotation#pos-tagging

NLTK

NLTK NER

NLTK (Natural Language Toolkit) is a Python package that provides a set of natural languages corpora and APIs of wide varieties of NLP algorithms.

To perform Named Entity Recognition using NLTK, it needs to be done in three stages —

  1. Work Tokenization
  2. Parts of Speech (POS) tagging
  3. Named Entity Recognition
pip install nltk

Now, let’s perform the first two stages here -

import nltk
print('NTLK version: %s' % (nltk.__version__))
from nltk import word_tokenize, pos_tag, ne_chunk
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')

Note, we need to download some standard corpora and API from NLTK to perform parts of speech tagging and named entity recognition. Hence, we downloaded these from nltk in the above Python code.

article = '''
Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword, while a
sharp drop in oil prices and political risks in Europe pushed the dollar to 16-month highs as investors dumped
riskier assets. MSCI’s broadest index of Asia-Pacific shares outside Japan dropped 1.7 percent to a 1-1/2
week trough, with Australian shares sinking 1.6 percent. Japan’s Nikkei dived 3.1 percent led by losses in
electric machinery makers and suppliers of Apple’s iphone parts. Sterling fell to $1.286 after three straight
sessions of losses took it to the lowest since Nov.1 as there were still considerable unresolved issues with the
European Union over Brexit, British Prime Minister Theresa May said on Monday.'''
def fn_preprocess(art):
art = nltk.word_tokenize(art)
art = nltk.pos_tag(art)
return art
art_processed = fn_preprocess(article)
art_processed
Snapshot of Output (POS tagging) from the above code

Now to understand what each codes mean, please refer to the below list-

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
FW foreign word
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
LS list marker 1)
MD modal could, will
NN noun, singular ‘desk’
NNS noun plural ‘desks’
NNP proper noun, singular ‘Harrison’
NNPS proper noun, plural ‘Americans’
PDT predeterminer ‘all the kids’
POS possessive ending parent’s
PRP personal pronoun I, he, she
PRP$ possessive pronoun my, his, hers
RB adverb very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP particle give up
TO, to go ‘to’ the store.
UH interjection
VB verb, base form take
VBD verb, past tense took
VBG verb, gerund/present participle taking
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what
WP$ possessive wh-pronoun whose
WRB wh-abverb where, when

Now once we have done the parts-of-speech tagging we will be doing a process called chunking. Text chunking is also called as shallow parsing which typically follows POS tagging to add more structure to the sentence. The result is grouping of words in “chunks”.

So, lets perform chunking to our article which we have already POS tagged.

Our target here would be to NER tag only the Nouns.

results = ne_chunk(art_processed)
for x in str(results).split('\n'):
if '/NN' in x:
print(x)

The snapshot of the output is as follows-

Snapshot of the output from the above code

The output looks decent but not great. Say we take up a little more complex task.

Say, we want to implement noun phrase chunking to identify named entities.
Our chunk pattern consists of one rule, that a noun phrase, NP, should be formed whenever the chunker finds an optional determiner, DT, followed by any number of adjectives, JJ, and then a noun, NN.
pattern = 'NP: {<DT>?<JJ>*<NN>}'
cp = nltk.RegexpParser(pattern)
cs = cp.parse(art_processed)
print(cs)

The output of the above chunking is below-

Snapshot from the output from above

The output can be read as a tree with “S” means the sentence as the first level. It can viewed in a more acceptable format called IOB tags (Inside, Outside, Beginning)

from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)
The snapshot of the output from the above code

Here, in the output each token is a line with parts-of-speech and named entity tagged. If you want to extract the IOB tags, as it is a tuple you simply do-

for word, pos, ner in iob_tagged:
print(word, pos, ner)

The entire code for the NLTK NER process-

NER using NLTK

What’s next ?

So, we have just learnt what is Named Entity Recognition tagging and how to use them to solve generic problems using API’s.

The natural progression from here would be to accomplish three things -

  1. Build your own NER tagger and also explore languages other than English.
  2. Build more sophisticated NER models (let’s say using Deep Learning) and also evaluate how better they perform.
  3. Take a task which you encounter daily which deals with Natural Language, figure out a problem which you want to solve and then use all what you have learnt in NER to solve it.

I will be working on these lines and will try to share my learning in coming posts on NER. You can contribute as well, please drop me how would you like to do that in the comment section.

https://www.askideas.com/the-pursuit-of-knowledge-is-more-valuable-than-its-possession/

Happy learning :)

Sources