Loading and Preparing a Dataset for Fine-Tuning the Pre-Trained bert-base-NER Model in Hugging Face for Named Entity Recognition (NER)

3 min readOct 8, 2023

This is a series of short tutorials about using Hugging Face. The table of contents is here.

In this lesson, we will load and prepare a dataset for fine-tuning the pre-trained bert-base-NER model for Named Entity Recognition (NER).

In Lesson 2.1, we applied a pre-trained model, bert-base-NER, to extract 4 types of pre-defined entities. In many applications, we need to extract different types of entities. To do so, we will fine-tune the pre-trained model on a dataset that defines different named entities to be extracted.

In this lesson, we begin with preparing a dataset for a fine-tuning process.

The WNUT 2017 Dataset

The Workshop on Noisy and User-generated Text (WNUT) focuses on Natural Language Processing applied to noisy user-generated text. The WNUT 2017 shared task provided data for identifying unusual, previously-unseen entities in the context of emerging discussions. We will use the WNUT 2017 dataset to fine-tune the bert-base-NER model for different entity types.

Load the WNUT 2017 Dataset

Let us begin with loading the WNUT 17 dataset from the Datasets library

from datasets import load_dataset

wnut = load_dataset("wnut_17")

The dataset has been split into train, test, and validation sets. An instance is a dictionary having three keys: ‘id’, ‘tokens’, and ‘ner_tags’.

List the Tag Names in the WNUT 2017 Dataset

Each number in ‘ner_tags’ represents an entity. We convert the numbers to their tag names which are listed as follows:

O
B-corporation
I-corporation
B-creative-work
I-creative-work
B-group
I-group
B-location
I-location
B-person
I-person
B-product
I-product

There are total 6 named entities plus the tag ‘O’. The 6 named entities are: Corporation, Creative-Work, Group, Location, Person, and Product.

Tokenize the Tokens into Subwords by the Tokenizer of bert-base-NER

To fine-tune the bert-base-NER model, we need to load a bert-base-NER tokenizer to preprocess the tokens. Each instance in the dataset has a ‘tokens’ field. It looks like the sentence has already been tokenized. But the sentence actually hasn't been tokenized. We will need to set ‘is_split_into_words’ to True to tokenize the words into subwords.

tokenized_result = tokenizer(rec["tokens"], is_split_into_words=True)

Assign Given Tags to Tokens after the Tokenization

After we applied the tokenizer to the input, we need to assign the given NER tags to the resultant tokens. However, the tokenization process adds two special tokens [CLS] and [SEP] to the tokenized result. The tokenization may also split a single word into several subwords. The special and subword tokens cause a mismatch between the tokenized result and the given tags in the datasets. We need to realign the subword tokens with the given tags during fine-tuning when using the datasets.

We apply the following steps to the assignment:

First, we map all subword tokens to their corresponding word. There is a word_id method of the tokenized result that maps tokens to their corresponding word ids.
Second, we assign the special tag -100 to the special tokens [CLS] and [SEP].
Third, for a word that was split into multiple subword tokens, we only assign the first token with the original tag. For the rest of the subword tokens, we assign the special -100 to them.

In the notebook (see link below), we create a function to realign the subword tokens with the given tags. We also use an example to illustrate the steps.

To apply the preprocessing function over the entire dataset, we apply the Huggingface Datasets map function. To speed up the ‘map’ function, we set ‘batched=True’ to process multiple elements of the dataset at once.

tokenized_wnut = wnut.map(tokenize_and_align_tags, batched=True)

Create Data Collator

Finally, we create a batch of examples using DataCollatorWithPadding which dynamically pads the sentences to the longest length in a batch during collation.

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Now, we have prepared the WNUT 2017 dataset. In the next lesson, we will use the dataset to fine-tune the bert-base-NER model to recognize the 6 named entities: Corporation, Creative-Work, Group, Location, Person, and Product.

The colab notebook is available here:

medium/src/working_huggingface/Working_with_HuggingFace_ch2_Preparing_Dataset_for_Fine_Tuning_NER_Mo…

Contribute to anyuanay/medium development by creating an account on GitHub.

github.com

The table of contents of the entire course is here: https://medium.com/@anyuanay/tutorials-on-working-with-hugging-face-models-and-datasets-a01dea1f1a81