NER

Build your NER data from scratch and learn the details of the NER model.

Pelin Balci
11 min readAug 15, 2023

Before we start, please take a look at my entire code on my GitHub: https://github.com/pelinbalci/LLM_Notebooks/blob/main/NER.ipynb I’ve prepared these codes with the help of the HuggingFace page, you may find the link in the References.

We will deep dive into these steps:

  • Data Process & DatasetDict
  • Tokenization
  • Model & Training Arguments & Trainer
  • Create a Pipeline for inference

In this article, I wrote the questions that came to my mind in some parts and proceeded by giving answers to them. So feel free to write comments if you have any further questions! ╰(*°▽°*)╯

Data Process

We can directly use prepared datasets for NER or we can create data from scratch. I’ve created an Excel file that has 3 columns: Sentence_ID, words, original_labels, and ner_tags.

Let’s look at the data:

# Load the Excel file into a DataFrame
df = pd.read_excel('nerdata.xlsx')

💄 What are the B and I indexes?

  • B: Beginning
  • I: Inside
  • O: Not belong to any label

You can skip this part:)

I’d like to stop here for a second. What is your method of learning? Induction or deduction? This is a very important issue because inductive learners learn piece by piece and then put these pieces together. But if you are a deductivist like me, this path always seems more challenging. And, unfortunately, in most places, you can only find a tiny piece of the story. Therefore, I will explain where we want to get to in terms of data preparation in a few sentences. Inductivists, you can skip here :)

(☞゚ヮ゚)☞ Which format do we need to turn this data frame?

  • Our initial aim is to turn these sentences into a list of words. For example: [‘I’, ‘love’, ‘you’]. So you may have different types of Excel, each sentence can be in one row, but you can still use some regex functions and turn them into a list.
  • We need the labels, B-Sub, I-Sub, O, etc.
  • ner_tags are important, we will directly use it. Remember that we don’t use embedding vectors here. These are just tokens for each word.
  • We then prepare a list of dictionaries and then turn them into DatasetDict format.

Ok. Let’s continue!

We will create a dictionary:

# Create a dict for dataset
raw_data_dict = {}
for idx in list(set(df.Sentence_ID.values)):
sentence = df[df.Sentence_ID == idx]
raw_data_dict[idx] = {}
raw_data_dict[idx]['words'] = list(sentence.Words.values)
raw_data_dict[idx]['original_labels'] = list(sentence.Labels.values)
raw_data_dict[idx]['ner_tags'] = list(sentence.ner_tags.values)
print('raw_data: ', raw_data_dict)
# Output
raw_data: {
1: {'words': ['I', 'Love', 'you'],
'original_labels': ['B-Sub', 'B-Verb', 'B-Obj'],
'ner_tags': [1, 3, 5]},
2: {'words': ['You ', 'and', 'Me', 'are ', 'going ', 'to ', 'the ', 'mall', 'today'],
'original_labels': ['B-Sub', 'O', 'I-Sub', 'B-Verb', 'I-Verb', 'O', 'O', 'B-Obj', 'I-Obj'],
'ner_tags': [1, 0, 2, 3, 4, 0, 0, 5, 6]},
3: {'words': ['When', 'what', 'how', 'are ', 'some ', 'of', 'the ', 'question', 'words', 'in ', 'English'],
'original_labels': ['B-Sub', 'I-Sub', 'I-Sub', 'B-Verb', 'O', 'O', 'O', 'B-Obj', 'I-Obj', 'O', 'I-Obj'],
'ner_tags': [1, 2, 2, 3, 0, 0, 0, 5, 6, 0, 6]},
4: {'words': ['Jane', 'and', 'I ', 'will', 'go ', 'to ', 'the ', 'cinema', 'today'], 'original_labels': ['B-Sub', 'O',
'I-Sub', 'B-Verb', 'I-Verb', 'O', 'O', 'B-Obj', 'I-Obj'],
'ner_tags': [1, 0, 2, 3, 4, 0, 0, 5, 6]},
5: {'words': ['Here', 'is ', 'a ', 'new', 'thought', 'I ', 'do ', 'not', 'like ', 'to ', 'learn', 'spanish'],
'original_labels': ['O', 'B-Verb', 'O', 'B-Obj', 'I-Obj', 'B-Sub', 'B-Verb', 'O', 'I-Verb', 'O', 'I-Verb', 'I-Obj'],
'ner_tags': [0, 3, 0, 5, 6, 1, 3, 0, 4, 0, 4, 6]},
6: {'words': ['She', 'is ', 'lovely'],
'original_labels': ['B-Sub', 'B-Verb', 'B-Obj'],
'ner_tags': [1, 3, 5]}
}

Let’s convert it to a list of dictionaries. (You can directly create this part from the data frame.)

# Convert raw_data to a list of dictionaries
data_list = []
for idx, data in raw_data_dict.items():
data_list.append({
'id': idx,
'words': data['words'],
'ner_tags': data['ner_tags'],
'pos_tags': [], # Placeholder, as your data doesn't have pos_tags
'chunk_tags': [] # Placeholder, as your data doesn't have chunk_tags
})
[
{'id': 1,
'words': ['I', 'Love', 'you'],
'ner_tags': [1, 3, 5],
'pos_tags': [],
'chunk_tags': []},
{'id': 2,
'words': ['You ', 'and', 'Me', 'are ', 'going ', 'to ', 'the ', 'mall', 'today'],
'ner_tags': [1, 0, 2, 3, 4, 0, 0, 5, 6],
'pos_tags': [],
'chunk_tags': []},
{'id': 3,
'words': ['When', 'what', 'how', 'are ', 'some ', 'of', 'the ', 'question', 'words', 'in ', 'English'],
'ner_tags': [1, 2, 2, 3, 0, 0, 0, 5, 6, 0, 6],
'pos_tags': [],
'chunk_tags': []},
{'id': 4,
'words': ['Jane', 'and', 'I ', 'will', 'go ', 'to ', 'the ', 'cinema', 'today'],
'ner_tags': [1, 0, 2, 3, 4, 0, 0, 5, 6],
'pos_tags': [],
'chunk_tags': []},
{'id': 5,
'words': ['Here', 'is ', 'a ', 'new', 'thought', 'I ', 'do ', 'not', 'like ', 'to ', 'learn', 'spanish'],
'ner_tags': [0, 3, 0, 5, 6, 1, 3, 0, 4, 0, 4, 6],
'pos_tags': [],
'chunk_tags': []},
{'id': 6,
'words': ['She', 'is ', 'lovely'],
'ner_tags': [1, 3, 5],
'pos_tags': [],
'chunk_tags': []}]

Then we will convert it to Hugging Face Dataset format. We will use DatasetDict.

# Convert the list to a Hugging Face Dataset
train_dataset = Dataset.from_dict({k: [d[k] for d in data_list] for k in data_list[0]})

# Create a DatasetDict
raw_data = DatasetDict({"train": train_dataset})
print("DatasetDict: ", raw_data)
DatasetDict:  DatasetDict({
train: Dataset({
features: ['id', 'words', 'ner_tags', 'pos_tags', 'chunk_tags'],
num_rows: 6
})

Labels

These dictionaries will be used in training:

# Get labels
label_ids = list(set(df.Labels))
label2id = {label: id for id, label in enumerate(label_ids)}
id2label = {id: label for label, id in label2id.items()}
label_ids:
['B-Verb', 'I-Sub', 'O', 'B-Obj', 'B-Sub', 'I-Verb', 'I-Obj']

label2id:
{'B-Verb': 0, 'I-Sub': 1, 'O': 2, 'B-Obj': 3, 'B-Sub': 4, 'I-Verb': 5, 'I-Obj': 6}

id2label:
{0: 'B-Verb', 1: 'I-Sub', 2: 'O', 3: 'B-Obj', 4: 'B-Sub', 5: 'I-Verb', 6: 'I-Obj'}

Tokenizer

We will call “bert-base-cased” model and these two functions (from Hugging Face*)

def align_labels_with_tokens(labels, word_ids):
new_labels = []
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a new word!
current_word = word_id
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# Special token
new_labels.append(-100)
else:
# Same word as previous token
label = labels[word_id]
# If the label is B-XXX we change it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)

return new_labels


def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(
examples["words"], truncation=True, padding=True, is_split_into_words=True
)
all_labels = examples["ner_tags"]
new_labels = []
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))

tokenized_inputs["labels"] = new_labels
return tokenized_inputs
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# Example Usage
inputs = tokenizer(raw_data["train"][0]["words"], is_split_into_words=True)
ner_tags = raw_data["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
example_tokenizeddata = align_labels_with_tokens(ner_tags, word_ids)

# Apply to all
tokenized_datasets = raw_data.map(
tokenize_and_align_labels,
batched=True,
remove_columns=raw_data["train"].column_names
)

Let’s examine the example usage. We call tokenizer and give these words:

raw_data["train"][0]["words"] ==> ["I", "love", "you"]
raw_data["train"][0]["ner_tags"] ==> [1,3,5]

The tokenizer module will give us an object and it contains following features:

  • input_ids: [101, 146, 2185, 1128, 102] → Each integer in this list corresponds to a specific token. 101 represents [CLS] (the beginning of the sequence token) and 102 represents [SEP] (the separator token)
  • token_type_ids: [0, 0, 0, 0, 0] → All the values are 0, indicating that all tokens belong to the same sequence. For two-sequence inputs, tokens from the first sequence would have 0, and tokens from the second sequence would have 1.
  • attention_mask: [1, 1, 1, 1, 1] → 1 indicates that the corresponding token should be attended to by the model, and 0 indicates that the token is padding and should be ignored. Here, all tokens are actual tokens and not padding,
  • tokens: [‘[CLS]’, ‘I’, ‘Love’, ‘you’, ‘[SEP]’]
  • word_ids: [None, 0, 1, 2, None]
  • words: [None, 0, 1, 2, None]

align_labels_wtih_token function creates new labels. How?

We use ner_tags and we will add -100 to the labels for the beginning of the sentence and the end of the sentence. This will give us new_labels and we define them in the inputs (the tokenizer object).

  • labels: [-100,1,3,5,-100]

Remember, we added padding=True. This will ensure that all tokens are of the same size. So the final output:

  • input_ids: [101, 146, 2185, 1128, 102,0,0,0,0,0,0,0]
  • token_type_ids: [0, 0, 0, 0, 0,0,0,0,0,0,0,0]
  • attention_mask: [1, 1, 1, 1, 1,0,0,0,0,0,0,0]
  • tokens: [‘[CLS]’, ‘I’, ‘Love’, ‘you’, ‘[SEP]’, ‘[PAD]’, ‘[PAD]’, ‘[PAD]’, ‘[PAD]’, ‘[PAD]’, ‘[PAD]’, ‘[PAD]’, ‘[PAD]’,]
  • word_ids: [None, 0, 1, 2, None, None, None, None, None, None, None, None]
  • words: [None, 0, 1, 2, None, None, None, None, None, None, None]
  • labels: [-100,1,3,5,-100,-100,-100,-100,-100,-100,-100,-100,-100]

Model & Training Arguments & Trainer

Please take a look at the code below:

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
pretrained_model_name_or_path="bert-base-cased",
num_labels=len(label_ids),
id2label=id2label,
label2id=label2id,
)

from transformers import TrainingArguments

args = TrainingArguments(
"bert-finetuned-ner",
evaluation_strategy="epoch",
save_strategy="epoch",
learning_rate=0.001,
num_train_epochs=5,
weight_decay=0.01,
# push_to_hub=True, # Open it if you would like to push to HuggingFace
)

from transformers import Trainer

trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_datasets["train"],
# eval_dataset=tokenized_datasets["validation"], # I don't have validation data:( But of course you need to split your data into train validation and test
# data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
)
trainer.train()
trainer.save_model("./models_23")

I think you have many questions. There is too much code here! Don’t worry, we will cover them all.

🎄 What is a data collator? Do I need it?

DataCollatorForTokenClassification is a class from the transformers library that is specifically designed for token classification tasks. It takes a tokenizer as an argument and uses it to handle the padding of the sequences. (from ChatGPT) It also defined in the same way in the Hugging Face page: “Here our labels should be padded the exact same way as the inputs so that they stay the same size, using -100 as a value so that the corresponding predictions are ignored in the loss computation. This is all done by a DataCollatorForTokenClassification.”

Since we add padding in tokenize_and_align_labels function, I don’t think that it is necessary to use it. Indeed, I don’t use Datacollator. Note that, the function in the HuggingFace page (tokenize_and_align_labels) doesn’t have padding.

You can define data collector by:

from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Here is an example batch from the data collator:

batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
print(batch["labels"])

# Output:
tensor([[-100, 1, 3, 5, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100],
[-100, 1, 0, 2, 3, 4, 0, 0, 5, 6, -100, -100,
-100, -100, -100]])

It is responsible for preparing batches of data during training or evaluation. The default data collator will typically handle the padding of the input sequences so that all sequences in a batch have the same length. It will also convert the input data into PyTorch tensors (or TensorFlow tensors, depending on the backend you are using). The answer is no, you can train the model without a data collector. Thetokenize_and_align_labels the function is used to preprocess the data. It tokenizes the input sequences and aligns the entity labels with the tokens. The Trainer is then able to use this preprocessed data directly, without the need for an explicit data_collator. (mixed answer from me and ChatGPT)

♟What is the right way to choose pre-trained model?

It depends on various factors:

  • Consider language: choose a model that has been pre-trained on a dataset in the language you are working with.
  • Check model performance: Hugging Face often includes performance metrics
  • Evaluate model size and speed: Larger models perform better but slower.
  • Test models: use a sample dataset and compare different model performances.

🎭 What are the Differences between Models?

The answer is prepared by ChatGPT:

Here are some of the popular NER models available in the Hugging Face Model Hub, along with a brief description of their differences:

BERT (Bidirectional Encoder Representations from Transformers):

Example: bert-base-cased

Description: BERT is a transformer-based model that has been pre-trained using a large corpus of text. It is then fine-tuned for the NER task. BERT is known for its strong performance on a wide variety of NLP tasks.

Differences: BERT models are available in various sizes (base, large) and can be cased or uncased. Cased models preserve the case of the input text, while uncased models do not.

DistilBERT:

Example: distilbert-base-cased

Description: DistilBERT is a smaller, faster, and lighter version of BERT. It is trained by distilling knowledge from a full-sized BERT model, but retains most of the original model’s performance.

Differences: It is significantly faster and requires less memory compared to BERT, making it more suitable for deployment on resource-constrained devices.

RoBERTa:

Example: xlm-roberta-base

Description: RoBERTa is a variant of BERT that is trained with more data and using different pre-training tasks. It generally outperforms BERT on a range of NLP tasks.

Differences: RoBERTa uses a different pre-training task compared to BERT and is generally trained with larger datasets.

ELECTRA:

Example: dbmdz/electra-large-discriminator-finetuned-conll03-english

Description: ELECTRA is a transformer model that, like BERT, is pre-trained using a large corpus of text. However, ELECTRA uses a different pre-training task that is more sample-efficient.

Differences: ELECTRA is trained using a discriminative task (distinguishing real tokens from fake ones) rather than a generative task, which tends to make it more efficient than models like BERT.

XLM-RoBERTa:

Example: xlm-roberta-base

Description: XLM-RoBERTa is a scaled-up version of RoBERTa pre-trained on multiple languages. It is designed for multilingual NLP tasks.

Differences: XLM-RoBERTa is trained on a large multilingual corpus, making it suitable for NER tasks in multiple languages.

Longformer:

Example: allenai/longformer-base-4096

Description: Longformer is designed to process long documents, which traditional transformer models like BERT struggle with due to memory constraints.

Differences: Longformer uses a sliding window attention mechanism, allowing it to efficiently process documents with thousands of tokens.

🎯 What are save_strategy and evaluation_strategy?

Answer from ChatGPT:

save_strategy controls how often the model and its configuration are saved during training.

Possible values are:

'epoch': Save the model at the end of each epoch.

'steps': Save the model every save_steps steps during training.

'interval': Save the model at regular intervals during training, every save_total_limit steps.

evaluation_strategy controls how often the model is evaluated on the evaluation dataset during training.

Possible values are:

'epoch': Evaluate the model at the end of each epoch.

'steps': Evaluate the model every eval_steps steps during training.

'interval': Evaluate the model at regular intervals during training, every eval_steps steps.

'no': Do not evaluate the model during training.

🎻How can we compute_metrics?

We need to calculate Precision, Recall, and F1 scores. Here is a sample code for computing metrics:

from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
import numpy as np

def compute_metrics(p):
predictions = np.argmax(p.predictions, axis=2)
true_labels = p.label_ids

# Remove ignored index (special tokens)
true_predictions = [
[id2label[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, true_labels)
]
true_labels = [
[id2label[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, true_labels)
]

return {
"precision": precision_score(true_labels, true_predictions),
"recall": recall_score(true_labels, true_predictions),
"f1": f1_score(true_labels, true_predictions),
"classification_report": classification_report(true_labels, true_predictions),
}

Create a Pipeline for inference

Now that we train the model, it is time to use it. Hugging Face has a beautiful module for it: pipeline.

from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModelForTokenClassification.from_pretrained("./models_234")

# Single prediction
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")
single_test_text = ["I", "think", "you"]
print(nlp(single_test_text))

You will call the tokenizer, then you will call your trained model and build your pipeline. Recall that, we prepared our dataset by making a list of words. Here, in the pipeline, we will give a list of words: [“I”, “think”, “you”]

Here is a sample output:

[[{'entity_group': 'Verb', '
score': 0.24876964,
'word': 'I',
'start': 0,
'end': 1}],
[{'entity_group': 'Sub',
'score': 0.27963677,
'word': 'think',
'start': 0,
'end': 5}],
[{'entity_group': 'Verb',
'score': 0.24145265,
'word': 'you',
'start': 0,
'end': 3}]]

As you see these are all wrong:) I’ve trained the model by using only 6 sentences and for only 3 epochs. It would have been better with much more data and more epochs!

Happy learning! ✨

References:

--

--