Named Entity Recognition (NER) for Turkish with BERT

Kenan Fayoumi
Analytics Vidhya
Published in
7 min readJun 25, 2020
Image Credit: https://www.codemotion.com/magazine/dev-hub/machine-learning-dev/bert-how-google-changed-nlp-and-how-to-benefit-from-this/

Introduction

Researchers at Google AI released the paper “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” 2 years ago. Since then, it had gained a lot of popularity in the Machine learning/ Natural Language Processing world. BERT was built on top of many successful and promising work that has been popular in the NLP world recently. Including, but not limited to, Seq2Seq architectures, Transformer (from the “Attention is All You Need Paper”) ELMO, ULM-FIT and unsupervised language modeling. At the time BERT was released, BERT showed state-of-the-art performances on a wide variety of NLP tasks. In this article, we will be fine-tuning a pre-trained Turkish BERT model on a Turkish Named Entity Recognition (NER) dataset. We will be using the popular HuggingFace pre-trained transformers library for the fine-tuning stage.We will also implement a known solution for handling BERT maximum sequence length problem by building overlapping sub-sequences.

In this article, I’m making the assumption that the readers already have background information on the following subjects:

  1. Named Entity Recognition (NER).
  2. Bidirectional Encoder Representations from Transformers (BERT).
  3. HuggingFace (transformers) Python library.

Focus of this article:

  1. Utilize HuggingFace Trainer class to easily fine-tune BERT model for the NER task (applicable to most transformers not just BERT).
  2. Handling sequences longer than BERT’s MAX_LEN = 512

HuggingFace Trainer Class:

Transformers new Trainer class provides an easy way of fine-tuning transformer models for known tasks such as CoNLL NER. Here are other supported tasks. This class will take care of training/evaluation loops, logging, model saving …etc. Which makes switching to other transformers models very easy. For this purpose we will use another class NerDataset which handles loading and tokenization of the data.

Preprocessing

To be able to utilize use Trainer model for the NER task, we have transform our data into CoNLL format as follows:

CoNLL file format

where each line has two columns “TOKEN LABEL” seperated by a white space, and different sentences are seperated by an empty line. To do this we first a create a list of tuples where each tuple has (sentence_id, token, label) then use these to initialize a Pandas Dataframe where each row represent the (sentence_id, token, label) tuples. Unfortunately, the dataset I’m using is private, therefore I can not actually share the dataset. I’ll just point out the expected format of data neaded to run this code. train_docs and test_docs are a list of strings. we perform simple white-space tokenization with the use of doc.split() you can skip this step (lines 6 and 11). train_labels and test_labels are lists of lists of token-level IOB tags\labels.

print(train_docs)###OUTPUT###
["This is one sentence",
"here's another one"]
print(train_labels)###OUTPUT###
[['O','O','O','O'],
['O','O','O']]
# check the first 10 rows
print(test_df.head(10))

We also need a list that contains all possible labels, and a label to integer mapping.

print(labels) ###OUTPUT###
['O', 'B_PERSON', 'I_PERSON', 'B_LOCATION', 'I_LOCATION', 'B_ORGANIZATION', 'I_ORGANIZATION']
print(label_map)###OUTPUT###
{0: 'O', 1: 'B_PERSON', 2: 'I_PERSON', 3: 'B_LOCATION', 4: 'I_LOCATION', 5: 'B_ORGANIZATION', 6: 'I_ORGANIZATION'}

NerDataset expects a .txt file for each train/test/dev set. So our next step is to create these CoNLL formatted files into a directory (“data\”) where we will keep our training and testing .txt files:

NerDataset

Our trainer object will expect the training input through NerDataset object. We already prepared the train.txt file so now we need to provide the BERT tokenizer and specify the needed parameter (files directory, max sequence length..etc.). First, we must download utils_ner.py which contains the definition of NerDataset.

!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/token-classification/utils_ner.py

This script must be added to python build path so you can simply import it. Now we must specify the model parameters, these are needed to initialize BERT tokenizer and the model. We will keep BERT model arguement in two seperate python dictionaries, one for BERT model parameters, and the other for data related parameters. The model we’re using is a cased base BERT model (BERTurk) pre-trained on a Turkish corpus of size 35GB and 44,04,976,662 tokens.

Next we initialize our Config and Tokenizer. Config is used to instantiate a BERT model according to the specified arguments (eg. number of classes) and define the model architecture. Tokenizer is in charge of preparing the inputs for a model. Using AutoModels here (AutoConfig, AutoTokenizer) facilitates using other transformer models (XLNet, RoBERTa…etc) by just providing the right model_name_or_path arguement. Since we have our sentences already tokenized and labeled retokenization by BERT Tokenizer can cause label alignment issues. We skip basic white-space tokenization by passing do_basic_tokenize = False to the tokenizer. The tokenizer then will only perform WordPiece tokenization. Here, I’m also defining AutoModelForTokenClassification which is basically our BERT model with a classification head on top for Token classification.

Now, we’re ready to create our training NerDataset object. Split is used to specify the mode for the dataset we’re creating. It has 3 states: train, test, dev. By specifying the mode, this object will automatically fetch the right file in “data_dir” (train.txt for this case).

Trainer

Now we are almost ready to create our trainer and start the training process. But, first we need to specify our training parameters. Trainer expects training parameters through TrainingArguments object. We will create a json file that has all our training parameters. Then we will use HfArgumentParser to parse this file and load the arguements into a TrainingArguments object.

Here I only specified some basic parameters (output_dir is a required arguement), for other parameters (Learning Rate, Weight Decay…etc.) I’ve simply used their default values. Here’s the full list of possible training arguements and their default values which can give more in logging and saving the model checkpoints. Finally, we can create our Trainer object and start the training process simply by calling .train() function.

Using Colab’s GPU, training takes around 1 hour (20'sh minutes per epoch).

Now we can evaluate our model performance using the testing dataset. We’ll start by creating an NerDataset for testing data. Then we can obtain the last layer output/activation for our inputs. These activations have the following shape (batch_size, seq_len, num_of_labels) which signifies the class probabilties for all testing examples. We then use argmax(axis= 2) function to get the label with the highest probability for each example.

Now to evaluate these predictions we can use seqeval library to calculate precision, recall and F1-score measures. First we install seqeval library:

!pip install seqeval

Then we need to get our real labels in the same shape of our predictions (num_of_examples,seq_len). We can use our original data which is exactly in that shape (before CoNNL transformation) . But I thought it might be useful to actually reverse from CoNLL format (our test dataframe) into a list of labels in the shape of (num_of_examples, seq_len).

And now we can simply evaluate our model by calculating F1-score.

F1-score: 95.2%

we can also see class-level performance by using the classification_report function. We can observe that classes with low support (I_ORGANIZATION, I_LOCATION) actually have the lowest f1 score.

Handling Long Sequences with BERT

One limitation of the BERT model is it maximum length constraint. Naturally you can’t process a sequence longer than 512 tokens. In our testing set, there are actually 8 sequences that haven’t been fully processed (cropped). One thing to note here, that even if we provide a sequence of 512 tokens (before performing wordpiece tokenization) that sequence most probably won’t actually be fully processed, because during wordpiece tokenization many word are actually split into word pieces, for example:

ORIGINAL: "Why isn't my text tokenizing"
WordPiece TOKENIZED: ['why', 'isn', "'", 't', 'my', 'text', 'token', '##izing']

So keeping this in mind, one way to actually check if a sequence will be fully processed is to actually try and tokenize the sequence and check if the length of the tokenized sequence is equal to or less than the maximum length used to initialize BERT.

# simple example
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-turkish-cased",do_basic_tokenize = False)
list_of_tokens = tokenizer.tokenize("Why isn't my text tokenizing")
print(list_of_tokens)

For the case of our predictions, we can actually compare the length of each example labels to the predicted ones. Predicted labels won’t exceed the maximum of 512 and when the two lengths mismatch, this means that example hasn’t been fully processed.

To overcome this problem, we can actually create overlapping sequences where a long sequence is split into two shorter overlapping sequences. We need the overlap to provide context for both sides of the splitted sequences. Then we can create a dataframe and repeat the steps we’ve done earlier and make predictions easily.

Next we have to actually combine (reconstruct) the predicted (original length) sequence of labels from overlapped ones. I have added comments to the code as much as I can, I hope it’s clear enough. After reconstructing the sequences, we then replace the old (not fully processed) predicted labels, with the ones we reconstructed. This way all real labels and predicted ones will match in length (example-wise).

We then can calculate the F1-score to check for any improvements

print("F1-score: {:.1%}".format(f1_score(test_labels, preds_list)))###OUTPUT###
F1-score: 95.4%

--

--