Named Entity Recognition (NER) Using the Pre-Trained bert-base-NER Model in Hugging Face

Yuan An, PhD
6 min readOct 5, 2023

--

This is a series of short tutorials about using Hugging Face. The table of contents is here.

In this lesson, we will learn how to extract four types of named entities from text through the pre-trained BERT model for the named entity recognition (NER) task. The four types of entities are: location (LOC), person (PER), organization (ORG), and miscellaneous (MISC).

Named Entity Recognition (NER) is a subtask of information extraction that classifies named entities into predefined categories. Here is an illustrative example. Given the following sentence,

Apple Inc. plans to open a new store in San Francisco by January 2024. Tim Cook, the CEO, announced the news yesterday.”

We can label the segments of bolded text as follows:

  • Apple Inc. — [ORG]: this is an organization.
  • San Francisco — [LOC]: this is a location.
  • Tim Cook — [PER]: this is a person.

For other segments, we may label them using [O] indicating they are not part of any named entities.

An NER model will be able to (1) identify segments and (2) label each segment with a pre-defined tag.

You may notice that a segment of text may consist of multiple tokens. In the context of Named Entity Recognition (NER), tags are defined using the B-I-O (Begin-Inside-Outside) scheme. These prefixes help differentiate the start of an entity, its continuation, and tokens that are not part of any entity. For example, instead of a single tag [ORG] for labeling all tokens in the name of an organization, we will have tags such as [B-ORG] and [I-ORG].

Here’s a breakdown of the B-I-O scheme:

  1. B- (Begin): This prefix indicates the beginning of a named entity. If an entity is only one word long, you would still use the B- prefix. For example, for the entity “San Francisco”, “San” would be labeled [B-LOC].
  2. I- (Inside): This prefix denotes that the token is inside an entity but is not the first token of that entity. Continuing with the “San Francisco” example, “Francisco” would be labeled [I-LOC].
  3. O (Outside): Tokens that aren’t part of any named entity are labeled with the O tag.

Load the bert-base-NER Model

We begin with loading the bert-base-NER model from Hugging Face.

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

Load the Tokenizer for the bert-base-NER Model

We load the toknizer for the same bert-base-NER model.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

Create an Instance of Pipeline

We create an instance of pipeline() with the model and tokenizer.

from transformers import pipeline

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

Extract Four Types of Entities: Location (LOC), Organizations (ORG), Person (PER), and Miscellaneous (MISC)

Let us prepare a text and apply the model to extract named entities.

text = "Apple Inc. plans to open a new store in San Francisco by January 2024. Tim Cook, the CEO, announced the news yesterday."

ner_results = nlp(text)
print(ner_results)

We got the following results:

[{'entity': 'B-ORG', 'score': 0.9996086, 'index': 1, 'word': 'Apple', 'start': 0, 'end': 5},
{'entity': 'I-ORG', 'score': 0.99942136, 'index': 2, 'word': 'Inc', 'start': 6, 'end': 9},
{'entity': 'B-LOC', 'score': 0.99934715, 'index': 11, 'word': 'San', 'start': 40, 'end': 43},
{'entity': 'I-LOC', 'score': 0.99942625, 'index': 12, 'word': 'Francisco', 'start': 44, 'end': 53},
{'entity': 'B-PER', 'score': 0.9997869, 'index': 18, 'word': 'Tim', 'start': 71, 'end': 74},
{'entity': 'I-PER', 'score': 0.99977297, 'index': 19, 'word': 'Cook', 'start': 75, 'end': 79}]

You may notice that the model returned recognized entities with pre-fixed tags. We can group the tokens into respective categories, such as [LOC], [PER], or [ORG]. The code for grouping tokens with pre-fixed tags is available in the notebook (see link below).

In the rest, we will load the CoNLL2003 dataset and evaluate the model on it.

Load the CoNLL2003 Dataset

from datasets import load_dataset

conll = load_dataset("conll2003")

The dataset is split into train, test, and validation datasets. A record in a dataset has the following features: [‘id’, ‘tokens’, ‘pos_tags’, ‘chunk_tags’, ‘ner_tags’]. A sentence is split into ‘tokens’ with tag ids in ‘ner_tags.’

We will conduct the following steps to evaluate the model on the CoNLL2003 dataset:

  • Get a list of tag names defined in the dataset.
  • Apply the model to the ‘tokens’ of an instance.
  • Extract the predicted tags for each token in the instance.
  • Retrieve the list of true tags for the instance.
  • Apply seqeval to the predicted tags and true tags for evaluation

We first illustrate the steps using a single instance, which is the record with index 12 in the test dataset.

Set the Example

Let us assign the example to the instance with index 12 in the test dataset.

example = conll['test'][12]

Get a List of Tag Names in the Dataset

We extract the list of tag names that are in the B-I-O scheme from the CoNLL2003 dataset. We can extract the list from either train, test, or validation dataset.

tag_names = conll["test"].features[f"ner_tags"].feature.names

The list of tag names are:

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

Apply the Model to the ‘tokens’ of the Example

We can directly apply the model to the ‘token’ field of the example. The pipeline() will handle the tokenization and classification processes.

ner_results = nlp(example['tokens'])

Extract the Predicted Tags for Each Token in the Example

From the classification results, we extract the predicted tag for each token. The tokenization process may split a token into subword tokens. As a result, we need to align the predicted tags with the original tokens. We do so by only using the predicted tag of the first subword token in the group of subword tokens that belong to the same original token. We assign ‘O’ to all other tokens that do not have a predicted tag.

predictions = []
for result in ner_results:
if len(result) == 0:
predictions.append('O')
else:
predictions.append(result[0]['entity'])

Extract the True Tags for the ‘tokens’ in the Example

We can map the ids in ‘ner_tags’ to their corresponding tag names.

true_tags = [tag_names[i] for i in example['ner_tags']]

How Good is the NER Result?

Now, we can check the accuracy of the NER result by counting the proportion of the correctly predicted named entities over the true tags. The result is 2/3 = 67.7%.

To evaluate the model’s performance over the entire dataset, we will leverage the evaluate library and seqeval function in Hugging Face.

Put all Together for NER on the Entire CoNLL2003 Data

We put the above code together for NER on the entire dataset.

from tqdm import tqdm

true_tags_list = []
predicted_tags_list = []

for atest in tqdm(test, desc=str(len(test))):

# add true labels to references
true_tags_list.append([tag_names[id] for id in atest['ner_tags']])

# recognize named entity in a test tokens
test_ner_results = nlp(atest['tokens'])

predicted_tags = []
# extract the predicted tags
for result in test_ner_results:
if len(result) == 0:
predicted_tags.append('O')
else:
predicted_tags.append(result[0]['entity'])

predicted_tags_list.append(predicted_tags)

Check that the Predictions Match the True Tags

Let us ensure that the length of the list of predicted tags is the same as the length of the list of true tags for each instance.

flag = True
for idx, apredi in enumerate(predicted_tags_list):
if len(apredi) != len(true_tags_list[idx]):
flag = False
print(idx, ":", False)
if flag:
print(True)

Apply seqeval to the Predictions and True Tags for Evaluation

First, install and import evaluate and seqeval from Hugging Face

! pip install evaluate, seqeval

import evaluate

seqeval = evaluate.load("seqeval")

Next, we evaluate the model’s performance by comparing the list of lists of predicted tags to the list of lists of true tags. Yes, the predicted_tags_list is a list of lists. Similary, true_tags_list is a list of lists.

results = seqeval.compute(predictions=predicted_tags_list, references=true_tags_list)

print("precision:", results["overall_precision"]),
print("recall:", results["overall_recall"]),
print("f1:", results["overall_f1"]),
print("accuracy:", results["overall_accuracy"])

We got the results:

precision: 0.3569140074330386
recall: 0.49309490084985835
f1: 0.4140956062746264
accuracy: 0.9066867664477226

The colab notebook is available here:

--

--

Yuan An, PhD

Faculty member in the College of Computing and Informatics at Drexel University; Doing research in NLP, Machine Learning, Ontology, Knowledge Graph, Embeddings