A Quick Primer on Named Entity Recognition

John Naujoks
The Startup
Published in
5 min readOct 15, 2019

What is it?

If you go back to primary school, you may remember having to diagram sentences. You break down a sentence into the different parts of speech each word represents. It may end up looking something like the image above.

Part-of-speech tagging is a task we find in Natural Language Processing (NLP) used to predict the structure of text to help determine meaning. With proper understanding of the sentence components and how the words fit together, it helps us effectively gather information or classify a body of text. But let’s look at these sentences:

Apples make an incredible pie.

Apple makes an incredible phone.

Though both nouns, people can read this and easily distinguish the difference. The first sentences is talking fruit and the other the company. This is where named entity recognition (NER) comes in. It is a sub-task in part-of-speech tagging that seeks to identify things like people, organizations, places, and locations. Essentially, anything that could be considered a specific object could be seen as an entity. Calling out these entities gives us a different way to interact text. With NER, you could parse a sentence like this:

Example of displaCy

What can you use it for?

So if we can use systems to label words in a corpus that refer to specific entities, not just particular parts of speech, we have some good uses for them. Here are just a few examples:

  • Classification: Maybe you want to help auto-generate tags or categories for content purely based on the text without manual labeling.
  • Content recommendations: Taking the classification one step further, you can recommend content based on it talking it referring to the same entities.
  • Comment analysis: Perhaps you have feedback given and you’ve done some sentiment analysis to see how things are going. NER could help you to further dig into if there are specific named entities that are effecting feedback.

Beyond these broad general ideas, NER has potential for many applications that seek to simplify text.

NER with Spacy

One of the better packages for NLP tasks with Python is Spacy. It has a robust set of features to take you from tokenization to modeling in just a few lines of code. We will be able to use Spacy for both part-of-speech tagging and named entity recognition. To start, we are going to import Spacy and load in a pre-trained model. The model we are using is one created by the Spacy team. It is a multi-task CNN model that was trained on OntoNotes ( a large manually-annotated corpus of text).

import spacy
nlp = spacy.load("en_core_web_sm")

Next, we take the text we are evaluating and feed it through the model. For funzies, I am using the first few sentences of a review for the movie Midsommar. Let’s feed it in and see what one sentence looks like when tagged with parts-of-speech:

doc = nlp(
"""
In case it wasn’t already clear, Midsommar confirms that American horror specialist Ari Aster is one of the slyer jokers to arrive in the movie business for a while.
"""
)
for token in doc:
print(token.text, token.pos_, token.tag_, token.dep_)

Result:

In(ADP) case(NOUN)it(PRON) was(VERB) n’t(ADV) already(ADV) clear(ADJ) ,(PUNCT) Midsommar(PROPN) confirms(VERB) that(ADP)
American(ADJ) horror(NOUN) specialist(NOUN) Ari(PROPN) Aster(PROPN) is(VERB) one(NUM) of(ADP) the(DET) slyer(NOUN) jokers(NOUN) to(PART) arrive(VERB) in(ADP) the(DET) movie(NOUN) business(NOUN) for(ADP) a(DET) while(NOUN).(PUNCT)

Maybe not the prettiest to read, but you can see that we were able to get each word and symbol labeled properly. As we can see, the parts of speech are labeled, but what if we want to track in reviews for this movie how many time the director is mentioned, or the actors, or other films. These could give an interesting look at what was said. NER results from our model are stored in the .ents of our processed text. Here is a look at what what found in our text:

for ent in doc.ents:
print(ent.text, ent.label_)

Result:

Midsommar GPE # GPE: Countries, cities, states
American NORP # NORP: Nationalities or religious or political groups.Ari Aster PERSON # PERSON: People, including fictional
one CARDINAL # CARDINAL: Numerals that do not fall under another type.
Displayed using displaCy

Nice! It seemed to pick up on the parts that are important to us, but there is one noticable issue. Midsommar, the title of the movie being discussed, was tagged as type ‘GPE’ which is for ‘Countries, cities, states’. It contextually understood it as an entity, but not the type we were hoping for. We can manually adjust it to WORK_OF_ART (Titles of books, signs, etc.) in one quick line:

doc[9].ent_type_ = 'WORK_OF_ART'

But let’s get very clever: Maybe we want to flag a word or phrase that it just won’t pick up. Imagine a review for the movie The Thing, I doubt it would always, if ever, get picked up. In the current doc, let’s say we want to consider “movie business” a named entity. We can set that manually with a Span, and if we wanted to, even make our own specific entity tag for this that we can call “industry”. We create the Span using the words index and then addedHere is how that would be done:

from spacy.tokens import Spanmd_ent = Span(doc, 27, 29, 'INDUSTRY')
doc.ents = list(doc.ents) + [md_ent]

There you have it! This is just the starting point, so dig in and see what NER can do for your data.

--

--