Journey of Named Entity Recognition

From Rules to Prompts, upward on ease but downward on debug-ability

Published in

Analytics Vidhya

8 min readApr 27, 2024

In today’s world, where data is abundant and comes in various forms, extracting meaningful information from text (called “unstructured”) has become paramount. Information Extraction (IE) is a field that deals with the automatic extraction of structured information from unstructured sources, such as natural language text.

Components

Named Entity Recognition (NER): NER is the process of identifying and classifying named entities such as people, organizations, locations, dates, and other predefined categories within unstructured text. It is a crucial component of IE as it lays the foundation for extracting structured information from raw text data.
Relation Extraction: This component focuses on identifying and extracting semantic relationships between the named entities recognized by NER. These relationships can be of various types, such as person-organization affiliations, location-organization associations, or cause-effect relationships between events.
Co-reference Resolution: In natural language, multiple mentions or references may refer to the same entity. Co-reference resolution aims to identify and link these references, ensuring that all mentions of an entity are correctly associated and resolved to the same underlying entity.

One of the critical components of IE is Named Entity Recognition (NER), which identifies and classifies entities like people, organizations, locations, and more. This article focuses on NER.

Applications

Knowledge Base Construction: IE plays a vital role in automatically constructing and populating large-scale knowledge bases by extracting structured information from unstructured sources like websites, documents, and social media data. These knowledge bases can be used for question-answering systems, recommendation engines, and other applications that require access to structured knowledge.
Question Answering Systems: By extracting relevant information from text sources and organizing it into structured formats, IE enables the development of advanced question-answering systems that can provide accurate and concise responses to natural language queries.
Business Intelligence: IE techniques can be applied to extract valuable insights from various business documents, reports, and data sources. This extracted information can be used for competitive analysis, market research, risk assessment, and other business intelligence applications.
Fraud Detection: IE can be employed to analyze and extract relevant information from documents, emails, and other communication channels, aiding in the detection of fraudulent activities, such as identity theft, financial fraud, or insider trading.
Biomedical Research: In the biomedical domain, IE is crucial for extracting relevant information from scientific literature, clinical trial reports, and patient records. This information can be used for drug discovery, identifying potential adverse effects, or understanding disease mechanisms.
Social Media Monitoring: IE techniques can be applied to social media data to extract information about public sentiment, trending topics, brand mentions, and other valuable insights for marketing, reputation management, and social listening purposes.

Named Entity Recognition (NER)

NER is a crucial component of IE that identifies and classifies entities in text into predefined categories such as persons, organizations, locations, dates, and more. For example, in the sentence “Steve Jobs co-founded Apple Inc. in Cupertino, California,” NER would identify “Steve Jobs” as a person, “Apple Inc.” as an organization, and “Cupertino, California” as a location.

import spacy
# Load the pre-trained NER model
nlp = spacy.load("en_core_web_sm")
# Text to process
text = "Steve Jobs co-founded Apple Inc. in Cupertino, California."
# Perform NER
doc = nlp(text)
# Print the entities
for ent in doc.ents:
 print(ent.text, ent.label_)

Output

Steve Jobs PERSON
Apple Inc. ORG
Cupertino, California LOC

Custom NER

While pre-trained Named Entity Recognition (NER) models, often trained on general-purpose datasets, can be useful for many common use cases, there are situations where these models may not perform optimally or fail to recognize domain-specific or specialized entities. In such scenarios, it becomes necessary to train a custom NER model tailored to the specific domain or entity types of interest. Training a custom NER model requires annotated data, where human experts have manually labeled the entities in the text corpus.

The most widely used annotation format for NER is the IOB (Inside, Outside, Beginning) format, also known as the BIO (Beginning, Inside, Outside) format. In this format, each token (word or symbol) in the text is assigned a tag that indicates its relationship to the named entities present in the text. The tags are structured as follows:

“O” (Outside) tag is assigned to tokens that are not part of any named entity.
“B-LABEL” (Beginning) tag is assigned to the first token of a named entity, where “LABEL” represents the entity type or category (e.g., B-PER for a person entity, B-ORG for an organization entity).
“I-LABEL” (Inside) tag is assigned to the remaining tokens within a named entity, following the token tagged with the “B-LABEL” tag.

This IOB/BIO annotation format allows the model to learn the boundaries and types of named entities during the training process. By providing a corpus of text annotated in this format, the custom NER model can learn to recognize and classify the specific entities relevant to the domain or use case, potentially improving performance compared to using a general-purpose pre-trained model.

Example IOB data

Steve O
Jobs O
co-founded O
Apple B-ORG
Inc. I-ORG
in O
Cupertino B-LOC
, I-LOC
California I-LOC
. O

This annotated data is then used to train machine learning models, such as Conditional Random Fields (CRFs) or deep learning architectures like Long Short-Term Memory (LSTM) networks or Transformers. During training, the model learns to map the input text sequences to the corresponding IOB tags, effectively learning to recognize and classify named entities.

After training, the custom NER model can be evaluated on a held-out test set, and its performance can be measured using metrics like precision, recall, and F1-score. If the model performs well, it can be deployed for extracting named entities from new, unseen text data within the specific domain or use case.

Use Case: Job Postings

Let’s consider a use case of extracting relevant information from job postings, such as job titles, companies, locations, and required skills.

Rule-based Approach

The rule-based approach for NER involves defining regular expressions that capture patterns and rules specific to the named entities of interest. Python’s built-in re library provides powerful tools for working with regular expressions, allowing developers to craft intricate patterns to match and extract named entities from text data.

For example, to extract organization names from text, one could define a regular expression pattern like r'(?:(?<=\b\w{2,})(?:\s+\w+)*)?\b(?:[A-Z][a-z]+(?:\s+[A-Z][a-z]+)*)\b(?:\s+(?:Inc|Corp|LLC|Ltd)\.?)?' to match common formats of organization names, including optional legal entity designations like "Inc." or "LLC".

import re

text = "We are seeking an experienced Software Engineer to join our team at Google in Mountain View, California. Skills required: Python, Java, and Machine Learning."

# Define regular expressions
job_title_pattern = r'(\w+\s*\w*\s?(?=\sto\s))'
company_pattern = r'([\w\s]+?(?=\sin\s))'
location_pattern = r'([\w\s,]+(?=\.))'
skills_pattern = r'Skills required:\s*([\w\s,]+)'

# Extract entities using regular expressions
job_title = re.search(job_title_pattern, text, re.I).group(1)
company = re.search(company_pattern, text, re.I).group(1)
location = re.search(location_pattern, text, re.I).group(1)
skills = re.search(skills_pattern, text, re.I).group(1)

print("Job Title:", job_title)
print("Company:", company)
print("Location:", location)
print("Skills:", skills)

Output:

Job Title: Software Engineer
Company: Google
Location: Mountain View, California
Skills: Python, Java, and Machine Learning

Machine Learning Approach

Conditional Random Fields (CRF) is a popular machine learning algorithm for sequence labeling tasks like NER. The pycrfsuite library provides an efficient implementation of CRF.

import pycrfsuite

# Define feature extractor
def word2features(sent, i):
    word = sent[i][0]
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word.isupper=%s' % word.isupper(),
        'word.isdigit=%s' % word.isdigit(),
    ]
    if i > 0:
        word1 = sent[i-1][0]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.isupper=%s' % word1.isupper(),
        ])
    else:
        features.append('BOS')

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.isupper=%s' % word1.isupper(),
        ])
    else:
        features.append('EOS')

    return features

# Load training data
X_train = [...]  # List of sentences (lists of words)
y_train = [...]  # List of corresponding IOB labels

# Train CRF model
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

trainer.train('job_posting_ner.model')

# Load test data
X_test = [...]  # List of test sentences

# Predict NER labels
tagger = pycrfsuite.Tagger()
tagger.open('job_posting_ner.model')

y_pred = [tagger.tag(xseq) for xseq in X_test]

Deep Learning Approach

Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) well-suited for sequence labeling tasks. Let’s implement a simple LSTM model for NER using PyTorch

import torch
import torch.nn as nn

class NERModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_layers, num_tags):
        super(NERModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, num_layers, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, num_tags)

    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        output = self.fc(lstm_out)
        return output

# Load training data
train_data = ...  # List of (sentence, labels) pairs

# Define model parameters
vocab_size = len(word_to_idx)
embedding_dim = 300
hidden_dim = 128
num_layers = 2
num_tags = len(tag_to_idx)

# Create the model
model = NERModel(vocab_size, embedding_dim, hidden_dim, num_layers, num_tags)

# Train the model
optimizer = torch.optim.Adam(model.parameters())
loss_fn = nn.CrossEntropyLoss()

for epoch in range(num_epochs):
    for sentence, labels in train_data:
        # Forward pass
        output = model(sentence)
        loss = loss_fn(output.view(-1, num_tags), labels.view(-1))

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Generative AI Approach

With the rise of large language models like GPT-3, few-shot learning has become a powerful technique for various NLP tasks, including NER. This approach involves providing the model with a few examples of the task and then prompting it to generalize to new instances.

import openai

# Set up OpenAI API key
openai.api_key = "YOUR_API_KEY"

# Define example job posting
example_job_posting = """
We are seeking an experienced Software Engineer to join our team at Google in Mountain View, California. Skills required: Python, Java, and Machine Learning.
"""

# Define prompt with few-shot examples
prompt = """
Extract the following entities from the given job posting text:

Job Title
Company
Location
Required Skills

Examples:

Text: We are looking for a talented Data Scientist to work at Microsoft in Redmond, Washington. Required skills include SQL, Python, and Machine Learning.
Job Title: Data Scientist
Company: Microsoft
Location: Redmond, Washington
Required Skills: SQL, Python, Machine Learning

Text: A leading AI company is hiring a Machine Learning Engineer in San Francisco, CA. Skills needed: PyTorch, TensorFlow, and Deep Learning.
Job Title: Machine Learning Engineer 
Company: A leading AI company
Location: San Francisco, CA
Required Skills: PyTorch, TensorFlow, Deep Learning

Text: """ + example_job_posting + """
Job Title:
Company:
Location:
Required Skills:
"""

# Send the prompt to the OpenAI API
response = openai.Completion.create(
    engine="text-davinci-003",
    prompt=prompt,
    max_tokens=200,
    temperature=0.3,
    n=1,
    stop=None,
)

# Print the generated output
print(response.choices[0].text.strip())

Conclusions

Over the years, Information Extraction and Named Entity Recognition have become increasingly accessible to developers and researchers. Rule-based approaches using regular expressions provide a straightforward way to extract entities, but they can be brittle and difficult to maintain for complex use cases.

Machine learning techniques, such as Conditional Random Fields (CRF) and deep learning models like LSTMs, offer more flexibility and generalization capabilities. However, they require labeled training data and can be computationally expensive to train and deploy.

Recently, the emergence of large language models and few-shot learning has opened up new possibilities for NER, allowing developers to leverage pre-trained models without the need for extensive training data or complex model architectures.

Ultimately, the choice of approach depends on factors such as the availability of labeled data, the complexity of the use case, computational resources, and the required level of interpretability and debuggability. In many scenarios, a hybrid approach that combines the strengths of different techniques may be the most effective solution.

As the field of Information Extraction and Named Entity Recognition continues to evolve, we can expect to see more powerful and user-friendly tools and frameworks that democratize these technologies and enable their widespread adoption across various domains.

Click pic below or visit LinkedIn to know more about the author