Named Entity Recognition using Deep Learning(ELMo Embedding+ Bi-LSTM)

Subham Sarkar
Jun 13 · 7 min read

Introduction :

Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organisations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

It adds a wealth of semantic knowledge to your content and helps you to promptly understand the subject of any given text.

Applications :

Few applications of NER include: extracting important named entities from legal, financial, and medical documents, classifying content for news providers, improving the search algorithms, and etc.

Approaches to tackle this problem:

  1. Machine Learning Approach : treating the problem as a Multi-class classification with named entities are our labels . The problem here is that for longer sentences identifying and labelling named entities require thorough understanding of the context of a sentence and sequence of the word labels in it, which this method ignores and cannot capture the essence of the entire sentence.
  2. Deep Learning Approach : The best possible model which can tackle this problem is Long-Short Time Memory(LSTM) models, specifically we will use Bi-directional LSTM for our setup . A Bi-directional LSTM is a combination of two LSTM’s — one runs forward from “right to left” and one runs backward from “left to right”, thus capturing the entire essence/context of the sentence . For NER, since the context covers past and future labels in a sequence, we need to take both the past and the future information into account.

Embedding Layer : ELMo (Embedding from Language Models): ELMo is a deep contextualised word representation that models both ,

(1) complex characteristics of word use (e.g., syntax and semantics), and

(2) how these uses vary across linguistic contexts (i.e., to model polysemy). Example: Although ‘Apple’ term is common, but ELMo will give different embeddings for both (fruit and organisation) due to contextual logic.

Example: Also we need not worry about the Out-Of-Vocabulary(OOV) token of training data , since ELMo would generate a character embedding for that as well.

These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.


Let’s see how we can approach this problem :

  1. Data Acquisition : We are going to use a dataset from Kaggle. Please go through the data to know more about the different tags used .

We have 47958 sentences in our dataset, 35179 different words ,42 different POS and 17 different named entities (Tags).

In this article we will build 2 different models for predicting Tag and POS respectively.

2. Next we would use a Class which would convert every sentence with its named entities (tags) into a list of tuples [(word, named entity), …]

3. Let’s have a look at the distribution of the sentence lengths in the dataset. so the longest sentence has 140 words in it and we can see that almost all of the sentences have less than 60 words in them. But due to hardware crunch we would use smaller length .i.e. 50 words, which can be easily processed.

4. Let’s create word-to-index and index-to-word mapping which is necessary for conversions for words before training and after prediction.

5. From the list of tuples generated earlier, now we will build the independent and dependent variable structure.

  • Independent variable / Words corpus :
  • And the same applies for the named entities but we need to map our labels to numbers this time

6. Train — Test Split (90:10):

7. Batch Training : Since we have 32 as the batch size, feeding the network must be in chunks that are all multiples of 32.

8. Loading ELMo Embedding Layer :We will import Tensorflow Hub ( a library for the publication, discovery, and consumption of reusable parts of machine learning models) to load the ELMo embedding feature and create a function so that we can load it in the form of a layer , and start building our Keras network.

Please downgrade your Tensorflow package, to use this code. If you want to perform the same in TF 2 or greater ,you have to use hub.load(url), then create a KerasLayer(… , with trainable=True).

9. Designing our Neural Network:

  • Embedding layer(ELMo): We will specify the maximum length (50) of the padded sequences. After the network is trained, the embedding layer will transform each token into a vector of n dimensions.
  • Bidirectional LSTM: Bidirectional LSTM takes a recurrent layer (e.g. the first LSTM layer) as an argument. This layer takes the output from the previous embedding layer .
  • We will use 2 Bi LSTM layers and residual connection to the first BiLSTM
  • TimeDistributed Layer: We are dealing with Many to Many RNN Architecture, where we expect output from every input sequence. Here is an example, in the sequence (a1 →b1, a2 →b2…an →bn), a, and b are inputs and outputs of every sequence. The TimeDistributeDense layers allow Dense(fully-connected) operation across every output over every time-step. Now using this layer will result in one final output.

10. Training : Ran this for only 1 epoch since it was taking a lot of time. But the results are awesome.

11. Batch Prediction and using index-to-tag to convert the predicted indices back to word format .

12. Evaluation Metric : In case of NER, we might be dealing with important financial, medical, or legal documents and precise identification of named entities in those documents determines the success of the model. In other words, false positives and false negatives have a business cost in a NER task. Therefore, our main metric to evaluate our models will be F1 score because we need a balance between precision and recall.

  • We were able to get F1-Score of 81.2% which is pretty good, if you look at the Micro,Macro and Average F1 scores as well they are pretty good. If you train this for more epochs you would definitely get better results.

13. Comparing our results with SPACY: We can see our model was able to detect every tag correctly even in single epoch.

Our Model Results

14. Parts of Speech Tagging/Prediction: Since, we also had Parts of Speech (POS) in our dataset, we can build similar model for predicting that as well. I have implemented that as well and trained that for 1 epoch and the results were again awesome.

  • We were able to get F1-Score of 97.1% which is pretty good, if you look at the Micro,Macro and Average F1 scores as well they are pretty good.

Comparing our results with SPACY: We can see our model was able to detect every tag correctly even in single epoch.

Our Model results

Thanks for reading this blog. If you liked it please clap, follow and share.

Where can you find my code ?

Github :

References :