How to build custom NER model with Context based Word Embeddings in Vernacular languages

Akash Singh
Saarthi.ai
Published in
8 min readApr 29, 2019

Today, we have a humongous volume of unstructured data from different sources like social media, news articles, conversational data, etc. This data is full of rich information, albeit, the information is hard to fetch, and this is one of the greater challenges of NLP(Natural Language Processing).

This is where Named Entity Recognition comes into play. In this article, we’re going to get hands-on with NER (Named Entity Recognition ). We are going to implement Custom NER using Allennlp framework, on a Hindi dataset which has “Weather” utterances tagged with entities.

Getting familiar with Named-Entity-Recognition (NER)

NER is a sequence-tagging task, where we try to fetch the contextual meaning of words, by using word embeddings. We use NER model for information extraction, to classify named entities from unstructured text into pre-defined categories. Named entities are real-world objects such as a person’s name, location, landmark, etc.

NER plays a key role in Information Extraction from documents ( e.g. emails), conversational data, etc. In fact, the two major components of a Conversational bot’s NLU are Intent Classification and Entity Extraction. NER in Conversational Agents is used for entity recognition and information extraction to retrieve pieces of information like date, location, email, phone number, etc.

Keeping the context in mind is an integral part of various NLP tasks. In NER having knowledge of context is really important which could not be achieved by traditional word embeddings such as (GLOVE, fasttext, Word2Vec etc.). Here, these embeddings assign only one representation per word, while in reality, different words have different meaning depending on where they are used and in what context they are used. Now, we are going to see how contextualized word embedding with an ELMO model, gives efficient contextual information.

ELMo(Embeddings from Language Models)

ELMO architecture

ELMO understands both the meaning of the words and the context in which they are found, as opposed to GLOVE embeddings, which only capture the meaning of the words, and are unaware of the context.

ELMO assigns embeddings to words based on the contexts in which they are used — to both capture the word meaning in that context as well as to fetch other contextual information.

Instead of using a fixed embedding for each word like in GLOVE, ELMo looks at the entire sentence before assigning each word an embedding. It uses a bi-directional LSTM trained on specific tasks to create those word embeddings.

ELMO provided a significant step towards pre-training in the context of Natural Language Processing(NLP). The Elmo LSTM can be trained on a massive dataset in any language to make custom language models , and then be re-used as a component in other models that are tasked with NLU (Natural Language Understanding ).

ELMO gained its prowess from being trained to predict the next word in a sequence of words — a task called Language Modelling, which is essential for NLU (Natural Language Understanding). This is convenient for making language models because we have vast amounts of text data that such a model can learn from, without the need of labels.

Contextualized word-embeddings can give words different embeddings based on the meaning they carry in the context of the sentence. (source: The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) by jay alammar)

The ELMo Paper, introduces deep contextualized word embeddings that model both complex characteristics of word use, like syntax and semantics, and how they vary across linguistic contexts. The paper illustrates how their pre-trained contextual language embeddings can be added to existing models to significantly improve the state of the art across challenging NLP problems. For a deeper understanding of the same, read the paper

Training an NER model in AllenNLP

Folks at AllenNLP came out with the concept of contextualized word-embeddings ELMO in there paper Deep contextualized word representations. Now, we are going to use the AllenNLP framework for our NER task.

Installation

  1. Creating a conda environment with python
conda create -n allennlp python=3.6

2. Activating the environment.

source activate allennlp

3. We are going to install from the source.

git clone https://github.com/allenai/allennlp.git
cd allennlp
pip install -r requirements.txt

To train a model in Allenlp Framework, we need to implement a DatasetReader for reading data and a model of our choice, which in this case is a CRFTagger model.

  1. Reading the data with DatasetReader 🤓

The data we have is BILOU tagged and is in CoNLL format. A CoNLL formatted data has one word per line with there entities separated by a space and the next sentence is separated by a line. In BILOU tagging scheme B(Beginning), I(Inside), L(Last), O(Outside), U(Unit).

क्या O
कल B-date
दोपहर L-date
में O
धूप U-weather_type
आएगी O
जयपुर U-location
में O
मौसम O
की O
स्थिति O
क्या O
है O

A DatasetReader reads a file and converts it to a collection of Instances. Here we are going to use the conll2003 dataset reader by tweaking it a little bit as per our requirements as we only have NER tagged data and this reader accepts a conll2003 dataset which has pos, NERand chunk tags too.

“A DatasetReader reads data from some location and constructs a Dataset. All parameters necessary to read the data apart from the file path should be passed to the constructor of the DatasetReader.” - AllenNLP documentation.

We have a weather dataset, with three entities location, weather_type, and date. We need to implement two methods the following two methods to read and make tokenized instances.

read()

The read() method gets data . With AllenNLP, you can set the path for the data files (the path for a JSON file for example). We’ll read every text and every label from the dataset, and wrap it with text_to_instance(), as shown below.

text_to_instance()

This method “does whatever tokenization or processing is necessary to go from textual input to an Instance”.

2. Now let’s move on to Model part 🧠

We are going to use the CRFTagger model provided in Allennlp Framework. We can use the model as it is. The CRFTagger encodes a sequence of text with a Seq2SeqEncoder, then uses a Conditional Random Field model to predict a tag for each token in the sequence.

Here we have a Bi-LSTM+ CRF model.

Although, the bi-lstm can capture conceptual meaning within words ,we still need to find a connection between the tags . As this can be a problem for NER, where you’ll never want to (for example) have a “start of a date” tag followed by an “inside a location” tag. Here Conditional random field comes to help and find dependencies between the tags.

The “linear-chain” conditional random field we’ll implement has a num_tags x num_tags matrix of transition costs, where transitions[i, j] represents the likelihood of transitioning from the j-th tag to the i-th tag.

In addition to whatever tags we're trying to predict, we'll have special "start" and "end" tags that we'll add before and after each sentence, in order to capture the "transition" between sentences.

In addition, CRF will accept an optional set of constraints that disallow “invalid” transitions (where “invalid” depends on what you’re trying to model.) For example, our NER data has distinct tags that represent the beginning, middle, and end of each entity type. We’d like to not allow a “beginning of a date entity” tag to be followed by an “end of location entity tag”, as explained below.

If we look into our data, in the first sentence combining कल (B-date)
दोपहर (L-date), we can get the date as कल दोपहर. However, for contextual appropriateness, कल(B-date) cannot be combined with (U-weather_type), as it is necessary for कल(B-date) to be followed with दोपहर(L-date). This is achieved with the help of CRF.

क्या O
कल B-date
दोपहर L-date
में O
धूप U-weather_type
आएगी O
जयपुर U-location
में O
मौसम O
की O
स्थिति O
क्या O
है O

For a deeper understanding of model you can check :

Creating a Config File 🤔

We need a config file to specify everything required to train the model. We need to specify the path of the train,Val ,fastText embedding , ELMo weights, and options files. Rest all the fields are self-justified.

n

Finally, we can start training 🤯🤯

Now, we are ready to train the NER model. For it, we need a few files.

  1. Training and validation data.
  2. Hindi ELMo weights and options file. We can use an ELMo model trained on Hindi wiki data. We are providing you with ELMo embeddings in Hindi, trained on Wikidump.
  3. fastText Hindi embeddings.

You can find all the files here.

While training, we track the accuracy after each epoch, and the best weights get saved. We can start training by calling the “train” method, by passing it to the model config file and model output path.

$ allennlp train path/to/config/file -s path/to/output/folder
Metrics after training completion

Here you can see we get different metrics after training, So now you can also play with the dataset to get good results on the validation data provided here. Happy Training!!!!!🎸🎸

Prediction

Once training is complete, we can make predictions by calling the “predict” method, by passing it to the saved model path and a test file, as shown below.

allennlp predict \
path/to/model.tar.gz \
path/to/test.txt\

When you run this, you get different logits, tags. All the words with there predicted tags are shown.

Resources for deeper understanding

  1. Bi-LSTM + CRF with character embeddings for NER and POS
  2. The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
  3. Bi-LSTM + CRF with pytorch
  4. Allennlp Documentation

Conclusion

I hope you were able to get a comprehensive grasp on how to implement an NER with contextualized embeddings(ELMo) using AllenNLP, and get hands-on with the same.

If you enjoyed this article, help us spread the word with the aspiring NLP developers. Follow us, share and do give us a clap.

--

--