Named Entity Recognition (NER) using BERT NLP — Tamil Model

8 min readDec 17, 2022

Named entity recognition (NER) — sometimes referred to as entity segmentation, retrival, or identification — is the task of identifying and categorizing key information (entities) in text. It is a sub-process of Information Extraction which is known as the process of automatic extraction of named entities by means of finding the entities in a given text (Recognition) and assigning their type (Classification).

Example:

From the above example sentence we can understood that it recognize and classifies the entites.It classifies India as LOC, Mr.Narendra Modi as PER and G20 as ORG.

LOC — Location

PER — Person

ORG — Organisation

Basic Steps In NER:

Sentence Segmentation:

Raw text will go through the Sentence Segmentation stage. It is a process of dividing a string of written language into its component sentences. For example, sentences must be separated when punctuation marks or periods are identified. The purpose of segmenting sentences is to place limits on sentences.

Tokenization:

Tokenization can be defined as a process of splitting a string or text into a list of tokens Tokens can be integer numbers as well as punctuation
marks.

Example: “Hello everyone.Welcome to the World.”
The tokens for the given sentence will be — [‘Hello’,‘everyone’, ‘Welcome’, ‘to’, ‘the’, ‘World’]

Parts of Speech Tagging (POS):

Part of Speech (POS) tagging is the process of assigning up a word in a text to a corresponding part of a speech tag according to its context and definition. It describes the characteristic structure of lexical terms within a sentence or text, which means the POS tags can be used for making assumptions about semantics.

Example:

Entity Detection:

Lastly, the Entity Detection stage. It is a process of identifying key elements from text, then classifies them into predefined categories. This is the crucial process of NER as it completes the purpose of NER. The image below shows an example result when named entities being detected in a sentence.

Example:

BERT:

BERT is a transformer-based architecture.

Transformers:

Google introduced the transformer architecture in the “Attention is All You Need” paper. Transformer uses a self-aware mechanism that is suitable for understanding language. The need for attention can be understood with a simple example. Say “I went to Horsley Hills this summer and it was pretty well developed considering the last time I was there”. The last word “there” refers to the Horsley Hills. But to understand this, it is necessary to remember the first parts. To achieve this, the attention mechanism decides at each stage of the input sequence which other parts of the sequence are important. Simply put, “You need context!”. The converter has an encoder-decoder architecture. They consist of modules that include forward and attention layers. The image below is from research.

In general, language models read the input sequence in one direction: either left to right or right to left. This type of one-way training works well when the goal is to predict/generate the next word. However, BERT uses bidirectional training to gain a deeper understanding of linguistic context. It is sometimes called “non-oriented”. So it considers both the previous and the next character at the same time. BERT applies Transformer bidirectional training to language modeling, learning textual representations. Note that BERT is just an encoder. It has no decoder. The encoder is responsible for reading and processing the text input. The decoder is responsible for creating a prediction for the task.

BERT Training ?

Now let’s consider the main question: How does BERT achieve two-way training?

It uses two methods: MLM (Masked LM) and NSP (Next Sentence Prediction)

MLM (Masked Language Modeling):

In series, we randomly mask some words by replacing them with [MASK]. In the magazine, they covered 15% of the streams. It is trained to predict these masked words using the context of the remaining words.

Example:

“I love cycling in the spring season” -> I love cycling in the [MASK] season.

The problem here is that the pretrained models have 15% masked IDs, but when we fine-tune the pretrained models and pass the input, we don’t pass the masked IDs. To solve this problem, 15% of characters selected for masking: 80% — replaced by [MASK] character, 10% of the time characters are replaced by a random character and the rest are left as is.

Next sentence prediction ( NSP):

To understand the relationship between two sentences, BERT uses NSP training. The model receives pairs of sentences as input and is trained to predict whether the second sentence is the next sentence of the first or not. In training, we give 50–50 input in both cases. The assumption is that the random sentence is contextually separate from the first sentence.

Project DataSet:

To build a model that accomplishes this task, we first need a dataset. For Tamil, we use the WikiANN dataset, which is readily available through the HuggingFace module. It contains several languages, where words are marked with identifiers such as place (LOC), organization (ORG) and person (PER). Here is a link to the data card for more information.

Building model:

Installing dependencies:

!pip install datasets
!pip install tokenizers
!pip install transformers

Load data:

from datasets import load_dataset
dataset = load_dataset("wikiann", "ta")
label_names = dataset["train"].features["ner_tags"].feature.namesabel_names = dataset["train"].features["ner_tags"].feature.names

Preprocessing data:

Tokenizing dataset and adjusting the labels:

Encode method returns the required keys (input_ids, token_type_ids, attention_mask) required by BERT
By using Map it allows adding new keys to existing splits of the hf dataset which eliminates the need for creating new dataset.
Adjustment of labels needed as token like “Steveharvey” will be split into “Steve” and “##harvey” but label for it would still be “B-PER”, to align it “Steve” will get label “B-PER” and ##harvey gets label “I-PER”

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-multilingual-cased")
def tokenize_adjust_labels(all_samples_per_split):
    tokenized_samples = tokenizer.batch_encode_plus(all_samples_per_split["tokens"], is_split_into_words=True)
    total_adjusted_labels = []
    print(len(tokenized_samples["input_ids"]))
    for k in range(0, len(tokenized_samples["input_ids"])):
        prev_wid = -1
        word_ids_list = tokenized_samples.word_ids(batch_index=k)
        existing_label_ids = all_samples_per_split["ner_tags"][k]
        i = -1
        adjusted_label_ids = []
        for wid in word_ids_list:
            if(wid is None):
                adjusted_label_ids.append(-100)
            elif(wid!=prev_wid):
                i = i + 1
                adjusted_label_ids.append(existing_label_ids[i])
                prev_wid = wid
            else:
                label_name = label_names[existing_label_ids[i]]
                adjusted_label_ids.append(existing_label_ids[i])
        total_adjusted_labels.append(adjusted_label_ids)
    tokenized_samples["labels"] = total_adjusted_labels
    return tokenized_samples
tokenized_dataset = dataset.map(tokenize_adjust_labels, batched=True)

Pad the samples per split:

Each token list per sample will be split
Sample x and sample y may not have same length so padding is needed
This will be used by Trainer API, this is the collate_fn equivalent from pytorch

from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer)

Integrating WANDB:

Weights and Bias is a highly efficient platform that helps streamline tracking model training, dataset versioning, hyperparameter optimization, and visualization. It is easier to track all the parameters of each experiment, how the losses change during each run, and so on, which makes debugging faster.

!pip install wandb
!pip install seqeval

import wandb
wandb.login()
#  You can find your API key in your browser here: https://wandb.ai/authorize
# please enter the below api key

Load,Preparing model to train:

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
import numpy as np
from datasets import load_metric
metric = load_metric("seqeval")
def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    flattened_results = {
        "overall_precision": results["overall_precision"],
        "overall_recall": results["overall_recall"],
        "overall_f1": results["overall_f1"],
        "overall_accuracy": results["overall_accuracy"],
    }
    for k in results.keys():
        if(k not in flattened_results.keys()):
            flattened_results[k+"_f1"]=results[k]["f1"]

    return flattened_results

model = AutoModelForTokenClassification.from_pretrained("bert-base-multilingual-cased", num_labels=len(label_names))
training_args = TrainingArguments(
    output_dir="./fine_tune_bert_output",
    evaluation_strategy="steps",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=7,
    weight_decay=0.01,
    logging_steps = 1000,
    report_to="wandb",
    run_name = "ep_10_tokenized_11",
    save_strategy='no'
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()
wandb.finish()

Performance on Test:

predictions, labels, _ = trainer.predict(tokenized_dataset["test"])
predictions = np.argmax(predictions, axis=2)
#Here we will see that the labels list will have lots of -100 in them however the corresponding label of the 
#tokenized_dataset doesnt have it, the reason is during DataCollator padding step, all padding tokens are added
#and assigned labels of -100 to get "ignored" in future computation of evaluation

# Remove ignored index (special tokens)
true_predictions = [
    [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_names[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results
print(true_labels)

Saving model:

model.save_pretrained("NER_TAM")
saved_model = AutoModelForTokenClassification.from_pretrained("NER_TAM")
saved_model

Testing Tamil Sentence With Saved Model:

import torch
random_sentence_from_internet = [ "இந்தியாவின் வெளியுறவுத்துறை அமைச்சர் திரு.ஜெய் சங்கர், ரஷ்யா-உக்ரைன் மோதலில் இந்தியாவின் நிலைப்பாட்டை தெளிவாக எடுத்துரைத்தார்." ]  
input = tokenizer(random_sentence_from_internet, is_split_into_words=True, return_tensors='pt')
print(input)
output = saved_model(**input)
predictions = torch.nn.functional.softmax(output.logits, dim=-1)
predictions = predictions.detach().numpy()
predictions = np.argmax(predictions, axis=2)
print(predictions)
pred_names = [label_names[p] for p in predictions[0]]
for index, id in enumerate(input["input_ids"][0]):
  print("\nID: ", id, "Decoded ID: ", tokenizer.decode(id), "\tPred: ", pred_names[index])

Link to Visit my NER-Tamil model:

https://huggingface.co/Ambareeshkumar/BERT-Tamil

Output of Tamil NER: