NLP-Custom Named Entity Recognition
Spacy and transformers : Complete Repo
Named Entity Recognition(NER) is one of the important NLP tasks. I’ll not go into the details of what NER is.
We’ll discuss how to create models to identify custom entities.
There are a lot of open source libraries which provide NER with standard entities like NAME,PLACE,TIME,ORG etc.
However, if we need different entities other than standard one depending on the domain then we need to create our own NER model to identify them.
Here, I’ve tried to cover 2 such methods of custom NER
- Custom NER using Spacy
- Custom NER using transformers library
Dataset — Complete Notebook
Data is collected using my project news_api and annotated using doccano.
Data annotated by doccano is accepted by Spacy.
Spacy Format:
“text”:”Tesla did not immediately respond with a comment.”,”label”:[[0,6,”CUSTOM_ORG”]]
#Custom Entities: CUSTOM_ORG,CUSTOM_PERSON,CUSTOM_PLACE,CUSTOM_ROLE
Transformer format:
However, we need to convert that to BILOU format for transformers.(You can check other formats like IBO etc. accepted to transformers models.)
Example, tokens = [‘I’,’am’,’Elon’] ner_tags=[‘O’,’O’,”U-PERSON”]
For this there is no need to annotate again.
We’ll convert data used for spacy training to BILOU format using Spacy library itself.
Custom Entities: B-CUSTOM_ORG,I-CUSTOM_ORG,L-CUSTOM_ORG,U-CUSTOM_ORG,B-CUSTOM_PERSON, I-CUSTOM_PERSON,L-CUSTOM_PERSON,U-CUSTOM_PERSON,B-CUSTOM_ROLE,I-CUSTOM_ROLE,L-CUSTOM_ROLE,U-CUSTOM_ROLE,B-CUSTOM_PLACE’,I-CUSTOM_PLACE,L-CUSTOM_PLACE,U-CUSTOM_PLACE
Model building:
Spacy Model- Complete Notebook
- Convert annotated data to doc object because Spacy accepts only doc object
# Convert training data
from spacy.util import filter_spansfor sample in tqdm(training_data):
text = sample['text']
labels = sample['entities']
doc = blank_model.make_doc(text)
ents = []
for start, end, label in labels:
span = doc.char_span(start, end, label=label, alignment_mode="contract")
if span is None:
print("Skip")
else:
ents.append(span)
filtered_ents = filter_spans(ents)
doc.ents = filtered_ents
doc_bin.add(doc)doc_bin.to_disk("train.spacy")
2.Update configuration files
Run:
python -m spacy init fill-config base_config.cfg config.cfg
3.Run training command
python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy
4.Inference output
Joerg Steinbach CUSTOM_PERSON , the regional economy minister CUSTOM_ROLE of Brandenburg CUSTOM_ORG , where Tesla CUSTOM_ORG has its factory near Berlin CUSTOM_PLACE
Transformers Model — Complete Notebook
1. Create torch Dataset
Adding the special tokens [CLS] and [SEP] and subword tokenization creates a mismatch between the input and labels. A single word corresponding to a single label may be split into two subwords. You will need to realign the tokens and labels by:
1.Mapping all tokens to their corresponding word with the word_ids method.
2.Assigning the label -100 to the special tokens [CLS] and [SEP] so the PyTorch loss function ignores them.
3.Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.
https://huggingface.co/docs/transformers/tasks/token_classification#preprocess
from transformers import AutoTokenizerclass Dataset(torch.utils.data.Dataset):
def __init__(self, X, y,max_length,id_to_label):
self.tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
self.X = X
self.y = y
self.id_to_label = id_to_labeldef __getitem__(self, idx):
return self.tokenize_and_align_labels(self.X[idx],self.y[idx])def __len__(self):
return len( self.X)
def tokenize_and_align_labels(self,x_el,y_el):
#print(row["tokens"])
#print( row['ner_tags'])
tokenized_inputs = self.tokenizer(x_el, truncation=True,padding="max_length", is_split_into_words=True,max_length=max_length)
data = {key: torch.tensor(val) for key, val in tokenized_inputs.items()}
#print(tokenized_inputs)
ner_tags = y_el
labels = []
word_ids = tokenized_inputs.word_ids(batch_index=0) # Map tokens to their respective word.
previous_word_idx = None
label_ids = []
for word_idx in word_ids: # Set the special tokens to -100.
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx: # Only label the first token of a given word.
label_ids.append(ner_tags[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)data["labels"] = torch.tensor(labels).squeeze()
return datatrain_dataset = Dataset(X, y,max_length,id_to_label)
2.Create Custom trainer to handle imbalance data
labels are imbalanced,We can use weightsampler to oversample data but instead let’s use different strategy.
we’ll create class_weights to give more weightage to weak class during training.
we can use strategy for majority class , class_weight = 1-(class sample/total sample)
for others, class_weight = 1-(class count/total samples- majority class sample)
#We should override comput_loss to inform trainerr about class imbalance(we have majority as "O" i.e not an entity)
class CustomTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.get("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has 17 labels with different weights)
#class_weights = torch.tensor([3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 3.0, 3.0,3.0, 0.2])
class_weights = torch.tensor([0.9857723577235772,0.8943089430894309, 0.8191056910569106, 0.9654471544715447, 0.983739837398374,
0.9796747967479675, 0.9939024390243902, 0.9024390243902439, 0.9654471544715447, 0.9857723577235772, 0.983739837398374, 0.8191056910569106,
0.9349593495934959, 0.9065040650406504, 0.9857723577235772, 0.8943089430894309, 0.06310119276644865])
loss_fct = torch.nn.CrossEntropyLoss(weight=class_weights,reduction='mean')
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else losstrainer = CustomTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=train_dataset,
)
3.Map each token to it’s original word and create map of word-entity type for inference
I’ve created a complete end to end project for custom ner model building to deployment. The project is production ready. You can refer it here.
The main challenges I’ve solved in this project:
- Use data annotated from doccano for both Spacy and transformers. No need to annotate data again.
- Handle Imbalance data for NER
- Handle alignment of labels of BILOU for transformers model encodings.
If you liked the article or have any suggestions/comments, please share them below!
Let’s connect and discuss on LinkedIn