NER with BERT in Action

5 min readJul 30, 2019

Intro

Hello friends, this is the first post of my serial “NLP in Action”, in this serial posts, I will share how to do NLP tasks with some SOTA technique with “code-first” idea — — which is inspired by fast.ai.

And I am also looking forwards for your feedback and suggestion.
My serial “NLP in Action” contains:

About NER

Named Entity Recognition (NER) is a usual NLP task, the purpose of NER is to tag words in a sentences based on some predefined tags, in order to extract some important info of the sentence.
Here is an example:

E.g.
Sentence: “Taylor Swift will launch her new album in Apple Music.”
NER result:“Taylor[B-PER] Swift[I-PER] will[O] launch[O] her[O] new[O] album[O] in[O] Apple[B-ORG] Music[I-ORG].[O]”
PS:
[O] means no meaning
[B-PER]/[I-PER] means person name
[B-ORG]/[I-ORG] means organization name

In NER, each token in the sentence will get tagged with a label, the label will tell the specific meaning of the token.
So that, through NER, we can analyze the sentence with more details and extract some important info.

NER Approaches

In order to do NER, there are 2 popular approaches :
- Multi-class Classification-based
- CRF based

Multi-class Classification-based
As introduce above, in NER tasks, each token in the sentence will get a label, we can treat NER process as a multi-class classification process, so that we can use some text classification method to label the token.

In this post, I will use this method to do NER

CRF based
Conditional Random Field(CRF), which is a probabilistic graphical model. When doing NER with CRF method, it will label the token with taking context into account, then predicts sequences of labels for sequences of sentence token then get the most reasonable one.

CRF is a very popular method for NER task, and I am also looking forwards to share this method later

About BERT

Bidirectional Encoder Representations from Transformers(BERT) is a language model comes from a google paper.
Learning from ELMO and GPT pre-trained model experience, BERT used the bidirectional training of Transformer to language model. This new method can have a deeper sense of language context. So that BERT model achieved state-of-the-art results in a wide variety of NLP tasks, including Question Answering (SQuAD v1.1), Natural Language Inference (MNLI), and others.
Using BERT for a specific task is very straightforward, we can download google pre-trained BERT model first, then use fine-tuning method to update the pre-trained model to fit downstream task needed, BERT is a specific transform learning method for NLP.

In this post, I will show how to use BERT method to do NER

In Action

As a saying goes “No water, no swimming, no sailing, no boating.”, it would be better to get our hand on code, so that we can get a clearer understanding of NER.

Here, I will use the excellent library transformers which deploy by huggingface, this library contains some state-of-the-art pre-trained models for Natural Language Processing (NLP) like BERT, GPT, XLNet … etc.

The process of doing NER with BERT contains 4 steps:
1. Load data
2. Set data into training embeddings
3. Train model
4. Evaluate model performance

All the code will show with jupyter notebook here.

billpku/NLP_In_Action

Do NLP tasks with some SOTA methods. Contribute to billpku/NLP_In_Action development by creating an account on GitHub.

github.com

And I will give a brief introduction of each step.

1.Load data
In order to do NER, we need dataset which each word was tagged with a label in the sentences.
First, we need to load the dataset with pandas:

Then we have a look at the data and analyze the tags distribution:

Since we will treat the NER process as a multi-class classification, we need to make the training data into “token-label” form, we need to parser the sentence from data set:

So that we can get lists of token and label:

2.Set data into training embeddings
After we get the data, we need to set the text into 3 kinds of embeddings:
- token embedding
- mask word embedding
- segmentation embedding(optional)

Token embedding
In order to make token embedding, we need to map the word token into id:

Mask word embedding
In order to make mask word embedding, we need to use 1 to indicate the real toke and 0 to indicate to pad token:

Segmentation embedding(optional)
In order to make segmentation embedding, we just set all the token into 0:

According to BERT usage, In NER task, the segmentation embedding will have no effect for the model, so , we don’t need to make segmentation embedding for each sentence.

3.Train model
When using transform learning like BERT, the process of training a new model with downstream data called “fine-tuning”, all we need to do it to choose one of the BERT pre-trained model and use our own data to update the model’s parameter to fix our downstream NLP task.
For English language, BERT have 2 kinds of model, cased model and uncased model.
Cased model will leave the word token with cased, while uncased model will lower all the word token. Since for NER task, the cased for the word token is important, e.g. when we talk about “Apple” mostly, we may mean the company, we talk about “apple” we may mostly mean the fruit.
So we will choose BERT cased model for fine-tuning new model, and be careful ,when we preprocess the data before this step, we also need to load BERT cased tokenizer and leave the cased form of word token unchanged.

4.Evaluate model performance
After training a new model for NER, we want to know how well the model will be. So that ,we can evaluate the model with new data.
The evaluate data can set in the process when we set training data batch
before, it is recommended to use 30% of data to act as testing data for performance validation.
Different from normal text classification task, we use F1 score as a bench mark for NER task.
After evaluated by testing data, the result may look like this:

Summary

NER is a task in NLP to label each token in the sentence, with the labels, we can know much better about the meaning of the text. In order to do NER, we can treat this process as a multi-class classification process, we can use BERT — a SOTA pre-trained model to easily fine-tune a model for NER downstream task.

Reference

In order to write down this post , I have learned and got inspired from these articles, thank you^-^

1.Named Entity Recognition (NER) Meeting Industry’s Requirement by Applying state-of-the-art Deep
2.Named Entity Recognition With Bert