Named Entity Recognition (NER) with spaCy
Named Entity Recognition (NER) is an important facet of Natural Language Processing (NLP). By using NER we can intelligently extract entity information (relevant nouns like places, people, locations etc.) from natural language to help derive more meaning from the text.
NER can be used to build recommendations, quickly extract relevant information from large text, customer support and even cataloging text content.
There are pre-trained models available from NLTK and spaCy for many NLP problems, including Named Entity Recognition. In this article, we will go through a gentle introduction on how to perform NER with spaCy.
Goes without saying that you will need to setup spaCy on your machine first. Please follow the instructions provided here to install spaCy on your machine.
spaCy supports multiple languages to varying levels, for a full list of models visit this page.
spaCy supports the following entity types for models trained on the OntoNotes 5.
Let’s take a look at an example, we are loading the “en_core_web_lg” model for NER. The model is English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. It assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.
nlp = spacy.load("en_core_web_lg")
passing text to the model,
doc = nlp(“Manchester United Football Club is a professional football club based in Manchester, England established in 1978”)
The model returns a spacy.tokens.doc.Doc object which you can iterate over. Since we did not define a custom pipeline for our model object, it performed all NLP operations supported by the model.
Now, let’s iterate through the named entities returned by the model
for ent in doc.ents:
Extracting named entities from a news article
For this example, we will be using an awesome library called newspaper to scrape a news article and perform NER on the content. The newspaper library provides a lot of functionality out of the box like the ability to summarise an article in addition to supporting non-English languages.
from newspaper import Article
import spacynlp = spacy.load("en_core_web_lg")url = r"https://techcrunch.com/2020/09/16/ios-14-is-now-available-to-download/"
article = Article(url)
article.parse()doc = nlp(article.text)for ent in doc.ents:
print(ent.text, ent.label_, ent.start_char, ent.end_char)
spaCy also provides a handy visualisation library called displacy to visualise a named entities in a text. You can use displacy, like so,
doc = nlp("Manchester United was founded in 1878 as Newon Heath in Manchester, England")
We can use it to even visualise dependency parse tree for a text, like so,
doc = nlp(u"Manchester United was founded in 1878 in Manchester, England")