NLP-Custom Named Entity Recognition

Sarang Mete
4 min readNov 1, 2022

--

Spacy and transformers : Complete Repo

Photo by DeepMind on Unsplash

Named Entity Recognition(NER) is one of the important NLP tasks. I’ll not go into the details of what NER is.

We’ll discuss how to create models to identify custom entities.

There are a lot of open source libraries which provide NER with standard entities like NAME,PLACE,TIME,ORG etc.

However, if we need different entities other than standard one depending on the domain then we need to create our own NER model to identify them.

Here, I’ve tried to cover 2 such methods of custom NER

  1. Custom NER using Spacy
  2. Custom NER using transformers library

Dataset — Complete Notebook

Data is collected using my project news_api and annotated using doccano.

Data annotated by doccano is accepted by Spacy.

Spacy Format:

“text”:”Tesla did not immediately respond with a comment.”,”label”:[[0,6,”CUSTOM_ORG”]]

#Custom Entities: CUSTOM_ORG,CUSTOM_PERSON,CUSTOM_PLACE,CUSTOM_ROLE

Transformer format:

However, we need to convert that to BILOU format for transformers.(You can check other formats like IBO etc. accepted to transformers models.)

Example, tokens = [‘I’,’am’,’Elon’] ner_tags=[‘O’,’O’,”U-PERSON”]

For this there is no need to annotate again.

We’ll convert data used for spacy training to BILOU format using Spacy library itself.

Custom Entities: B-CUSTOM_ORG,I-CUSTOM_ORG,L-CUSTOM_ORG,U-CUSTOM_ORG,B-CUSTOM_PERSON, I-CUSTOM_PERSON,L-CUSTOM_PERSON,U-CUSTOM_PERSON,B-CUSTOM_ROLE,I-CUSTOM_ROLE,L-CUSTOM_ROLE,U-CUSTOM_ROLE,B-CUSTOM_PLACE’,I-CUSTOM_PLACE,L-CUSTOM_PLACE,U-CUSTOM_PLACE

Model building:

Spacy Model- Complete Notebook

  1. Convert annotated data to doc object because Spacy accepts only doc object
# Convert training data
from spacy.util import filter_spans
for sample in tqdm(training_data):
text = sample['text']
labels = sample['entities']
doc = blank_model.make_doc(text)
ents = []
for start, end, label in labels:
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skip")
else:
ents.append(span)
filtered_ents = filter_spans(ents)
doc.ents = filtered_ents
doc_bin.add(doc)
doc_bin.to_disk("train.spacy")

2.Update configuration files

Run:
python -m spacy init fill-config base_config.cfg config.cfg

3.Run training command

python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

4.Inference output

Joerg Steinbach CUSTOM_PERSON , the regional economy minister CUSTOM_ROLE of Brandenburg CUSTOM_ORG , where Tesla CUSTOM_ORG has its factory near Berlin CUSTOM_PLACE

Transformers Model — Complete Notebook

1. Create torch Dataset

Adding the special tokens [CLS] and [SEP] and subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may be split into two subwords. You will need to realign the tokens and labels by:

1.Mapping all tokens to their corresponding word with the word_ids method.

2.Assigning the label -100 to the special tokens [CLS] and [SEP] so the PyTorch loss function ignores them.

3.Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

https://huggingface.co/docs/transformers/tasks/token_classification#preprocess

from transformers import AutoTokenizerclass Dataset(torch.utils.data.Dataset):
def __init__(self, X, y,max_length,id_to_label):
self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
self.X = X
self.y = y
self.id_to_label = id_to_label
def __getitem__(self, idx):
return self.tokenize_and_align_labels(self.X[idx],self.y[idx])
def __len__(self):
return len( self.X)

def tokenize_and_align_labels(self,x_el,y_el):
#print(row["tokens"])
#print( row['ner_tags'])
tokenized_inputs = self.tokenizer(x_el, truncation=True,padding="max_length", is_split_into_words=True,max_length=max_length)
data = {key: torch.tensor(val) for key, val in tokenized_inputs.items()}
#print(tokenized_inputs)
ner_tags = y_el
labels = []
word_ids = tokenized_inputs.word_ids(batch_index=0) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(ner_tags[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
data["labels"] = torch.tensor(labels).squeeze()
return data
train_dataset = Dataset(X, y,max_length,id_to_label)

2.Create Custom trainer to handle imbalance data

labels are imbalanced,We can use weightsampler to oversample data but instead let’s use different strategy.

we’ll create class_weights to give more weightage to weak class during training.

we can use strategy for majority class , class_weight = 1-(class sample/total sample)

for others, class_weight = 1-(class count/total samples- majority class sample)

#We should override comput_loss to inform trainerr about class imbalance(we have majority as "O" i.e not an entity)
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 17 labels with different weights)
#class_weights = torch.tensor([3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 0.2])
class_weights = torch.tensor([0.9857723577235772,0.8943089430894309, 0.8191056910569106, 0.9654471544715447, 0.983739837398374,
0.9796747967479675, 0.9939024390243902, 0.9024390243902439, 0.9654471544715447, 0.9857723577235772, 0.983739837398374, 0.8191056910569106,
0.9349593495934959, 0.9065040650406504, 0.9857723577235772, 0.8943089430894309, 0.06310119276644865])
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights,reduction='mean')
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
trainer = CustomTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=train_dataset,
)

3.Map each token to it’s original word and create map of word-entity type for inference

Image by Author
Image by Author

I’ve created a complete end to end project for custom ner model building to deployment. The project is production ready. You can refer it here.

The main challenges I’ve solved in this project:

  1. Use data annotated from doccano for both Spacy and transformers. No need to annotate data again.
  2. Handle Imbalance data for NER
  3. Handle alignment of labels of BILOU for transformers model encodings.

If you liked the article or have any suggestions/comments, please share them below!

Let’s connect and discuss on LinkedIn

--

--