NLP — Text PreProcessing — Practical : Named Entity Recognition(NER) — Part 5.2

Chandu Aki
The Deep Hub
Published in
6 min readMar 10, 2024

In the previous article, “NER Part 1,” we delved into the theoretical understanding of NER. In today’s article, we will explore its practical application.

Several tools are available for Named Entity Recognition (NER) extraction, each with its strengths and use cases. Here are some popular tools for NER extraction:

Spacy:

  • SpaCy is an open-source natural language processing library designed for efficiency and ease of use. It provides pre-trained models for NER across multiple languages.
  • Link: SpaCy

NLTK (Natural Language Toolkit):

  • NLTK is a powerful library for working with human language data. It provides tools for various tasks, including NER, and is widely used in academia and industry.
  • Link: NLTK

Stanford NER:

  • Stanford NER is a Java-based toolkit provided by the Stanford Natural Language Processing Group. It offers pre-trained models for NER in multiple languages.
  • Link: Stanford NER

AllenNLP:

  • AllenNLP is a library built on top of PyTorch for natural language processing. It provides pre-trained models for various NLP tasks, including NER.
  • Link: AllenNLP

BERT-based Models (e.g., Hugging Face Transformers):

  • BERT (Bidirectional Encoder Representations from Transformers) and its variations have been applied to NER tasks. Hugging Face Transformers provide easy-to-use interfaces for such models.
  • Link: Hugging Face Transformers

OpenNLP:

  • Apache OpenNLP is a machine learning-based toolkit for natural language processing. It includes tools for NER, part-of-speech tagging, and more.
  • Link: OpenNLP

GATE (General Architecture for Text Engineering):

  • GATE is an open-source software framework for natural language processing and text mining. It provides components for NER and other text processing tasks.
  • Link: GATE

MITIE (MIT Information Extraction):

  • MITIE is a library for information extraction developed by the MIT Computer Science and Artificial Intelligence Laboratory. It includes tools for NER.
  • Link: MITIE

These tools vary in terms of ease of use, language support, and the types of pre-trained models they offer. The choice of a specific tool depends on the requirements of your project and the resources available.

NER using Spacy

SpaCy provides several pre-trained models for various languages and use cases. Selecting the right model depends on your specific NLP task and the language you’re working with. Here’s a beginner-friendly guide to SpaCy’s pre-trained models:

1. English Models:

en_core_web_sm (Small):

  • Use Case: Quick and lightweight processing.
  • When to Choose: Ideal for experimentation, testing, and when resource constraints are a concern. Suitable for basic NLP tasks.
  • Usage:
import spacy 
nlp = spacy.load("en_core_web_sm")

en_core_web_md (Medium):

  • Use Case: Balanced performance for various NLP tasks.
  • When to Choose: Good choice for a wide range of applications, including named entity recognition, part-of-speech tagging, and dependency parsing.
  • Usage:
import spacy 
nlp = spacy.load("en_core_web_md")

en_core_web_lg (Large):

  • Use Case: High-performance processing with extensive vocabulary and word vectors.
  • When to Choose: Recommended for advanced NLP tasks, especially when dealing with large amounts of text and complex language structures.
  • Usage:
import spacy 
nlp = spacy.load("en_core_web_lg")

2. Multi-language Models:

xx_ent_wiki_sm (Small):

  • Use Case: Basic processing for multiple languages.
  • When to Choose: Suitable for multilingual applications where a lightweight model is sufficient.
  • Usage:
import spacy 
nlp = spacy.load("xx_ent_wiki_sm")

xx_ent_wiki_md (Medium):

  • Use Case: Balanced performance for multilingual applications.
  • When to Choose: Provides a good compromise between performance and resource requirements when working with multiple languages.
  • Usage:
import spacy 
nlp = spacy.load("xx_ent_wiki_md")

3. Custom Models:

  • Creating a Custom Model:
  • Use Case: Tailor the model to your specific domain or language requirements.
  • When to Choose: When working with specialized text, such as legal documents or domain-specific jargon. Requires training on a labeled dataset.
  • Usage:
import spacy 
from spacy.lang.en import English

# Create a blank English model
nlp_custom = English()

# Add the Named Entity Recognition component
ner = nlp_custom.create_pipe("ner")
nlp_custom.add_pipe(ner)

# Train the model on a labeled dataset

Choosing Based on Task:

  • Named Entity Recognition (NER): Choose a model with the prefix en_core_web (e.g., en_core_web_sm) or xx_ent_wiki based on language preference and resource requirements.
  • Other NLP Tasks (e.g., Tokenization, POS Tagging): Models like en_core_web_md or en_core_web_lg provide comprehensive support for various NLP tasks.

Quick Start with NER:

Using a Sample Pre-trained Model:

import spacy

# Load the English NER model
nlp = spacy.load("en_core_web_sm")

# Process text and extract named entities
text = "Alan Turing is the father of Natural Language Processing."
doc = nlp(text)

# Access named entities and their labels
for ent in doc.ents:
print(f"{ent.text} --> {ent.label_}")

Output :

Alan Turing → PERSON
Natural Language Processing → ORGANIZATION

Spacy NER Summary:

  • Consider Model Size vs. Performance: Choose a model size based on the trade-off between performance and resource constraints.
  • Multilingual Support: Use xx_ent_wiki models for multilingual applications.
  • Custom Models: Consider creating a custom model for domain-specific tasks.

Selecting the right SpaCy pre-trained model is a crucial step that impacts the efficiency and accuracy of your NLP applications. Consider the task at hand, available resources, and language requirements to make an informed decision.

Stanford NER

Stanford NER provides pre-trained models for different entity classification tasks. Let’s consider a sentence and use three different models:

  • 3-class (Person, Organization, Location),
  • 4-class (adding Miscellaneous),
  • 7-class (including Date, Time, Percentage, and Money).

Supported Languages: While initially designed for English, Stanford NER has been adapted for use with other languages, including Chinese, German, Spanish, and more.

Consider the example Sentence: “Stanford University, founded in 1885, is located in California. John Smith works there as a professor.”

Stanford NER 3-Class Model:

from nltk.tag import StanfordNERTagger
import nltk

# Set the path to the Stanford NER JAR file and 3-class model
jar_path = 'stanford-ner.jar'
model_path = 'english.all.3class.distsim.crf.ser.gz'

# Create the NER tagger for 3-class model
ner_tagger_3class = StanfordNERTagger(model_path, jar_path)

# Tokenize and tag entities for 3-class model
text = "Stanford University, founded in 1885, is located in California. John Smith works there as a professor."
tokens = nltk.word_tokenize(text)
entities_3class = ner_tagger_3class.tag(tokens)
print(entities_3class)

Output (3-Class):

[('Stanford', 'ORGANIZATION'), ('University', 'ORGANIZATION'), (',', 'O'), ('founded', 'O'), ('in', 'O'), ('1885', 'O'), (',', 'O'), ('is', 'O'), ('located', 'O'), ('in', 'O'), ('California', 'LOCATION'), ('.', 'O'), ('John', 'PERSON'), ('Smith', 'PERSON'), ('works', 'O'), ('there', 'O'), ('as', 'O'), ('a', 'O'), ('professor', 'O'), ('.', 'O')]

what the “O” label signifies:

  • O (Outside): This label is assigned to tokens that do not represent the beginning, inside, or end of any recognized entity. Essentially, it denotes words that are not part of the identified named entities.

Stanford NER 4-Class Model:

# Set the path to the Stanford NER JAR file and 4-class model
model_path_4class = 'english.conll.4class.distsim.crf.ser.gz'

# Create the NER tagger for 4-class model
ner_tagger_4class = StanfordNERTagger(model_path_4class, jar_path)

# Tokenize and tag entities for 4-class model
entities_4class = ner_tagger_4class.tag(tokens)
print(entities_4class)

Output (4-Class):

[('Stanford', 'ORGANIZATION'), ('University', 'ORGANIZATION'), (',', 'O'), ('founded', 'O'), ('in', 'O'), ('1885', 'O'), (',', 'O'), ('is', 'O'), ('located', 'O'), ('in', 'O'), ('California', 'LOCATION'), ('.', 'O'), ('John', 'PERSON'), ('Smith', 'PERSON'), ('works', 'O'), ('there', 'O'), ('as', 'O'), ('a', 'O'), ('professor', 'O'), ('.', 'O')]

Stanford NER 7-Class Model:

# Set the path to the Stanford NER JAR file and 7-class model
model_path_7class = 'english.muc.7class.distsim.crf.ser.gz'

# Create the NER tagger for 7-class model
ner_tagger_7class = StanfordNERTagger(model_path_7class, jar_path)

# Tokenize and tag entities for 7-class model
entities_7class = ner_tagger_7class.tag(tokens)
print(entities_7class)

Output (7-Class):

[('Stanford', 'ORGANIZATION'), ('University', 'ORGANIZATION'), (',', 'O'), ('founded', 'O'), ('in', 'O'), ('1885', 'DATE'), (',', 'O'), ('is', 'O'), ('located', 'O'), ('in', 'O'), ('California', 'LOCATION'), ('.', 'O'), ('John', 'PERSON'), ('Smith', 'PERSON'), ('works', 'O'), ('there', 'O'), ('as', 'O'), ('a', 'O'), ('professor', 'O'), ('.', 'O')]

Explanation:

  • In the 3-class model, entities are classified into Person, Organization, and Location.
  • The 4-class model adds a miscellaneous class (‘O’) for non-entities.
  • The 7-class model includes additional classes for Date, Time, Percentage, and Money.

Summary

  1. Key Functionality: NER identifies and classifies entities (e.g., persons, organizations, locations) in text, providing valuable insights into unstructured data.
  2. Practical Usage: Stanford NER and SpaCy are prominent tools for practical NER applications, offering pre-trained models for various languages.
  3. Stanford NER Models: Stanford NER provides models like the 3-class, 4-class, and 7-class, each serving specific entity recognition needs.
  4. SpaCy for NER:SpaCy, with models like en_core_web_sm, provides an intuitive interface for efficient and accurate NER tasks.
  5. Tool Flexibility: Depending on use cases, toolkits like NLTK can also be utilized for NER, offering flexibility and adaptability.
  6. Information Extraction: NER supports information extraction by pinpointing entities in text, enhancing applications like question answering and knowledge extraction.
  7. Wide Application: NER is widely applied in diverse domains, including information retrieval, chatbots, and sentiment analysis, where entity recognition is critical.
  8. Enhancing Understanding: By recognizing entities, NER contributes to a deeper understanding of unstructured text, enabling meaningful insights for downstream applications.

The choice between Stanford NER, SpaCy, NLTK, or other toolkits depends on specific use cases, language support, and model performance requirements. NER tools continue to evolve, with updates, new models, and advancements enhancing their capabilities for accurate entity recognition

--

--

Chandu Aki
The Deep Hub

Aspiring Data Scientist|Dynamic Data Analyst | Sales Analytics Expert | AI & ML , NLP , Generative AI Enthusiast