Extracting information from unstructured document
Named Entity Recognition with BiLSTM-CNNs
Deep Learning based entity extraction
Introduction
Among the wide range of Natural Language Processing tasks, Named Entity Recognition (NER) has always been focusing a strong stream of research works. Named Entity Recognition consists in extracting structured pieces of information such as a Name, a Location, a Company, a University name, an Activity sector, etc. from chunks of unstructured text.
Business and industry applications are numerous: extracting relevant information from resumes (Names, Skills, College, Hobbies), speeding up document processing in CRMs, building company profiles based on annual reports, etc. Most advanced chatbots often integrate NER pipelines in their AI engines.
From simple regular expressions to cutting-edge Deep Learning implementations, this task can be tackled with various levels of complexity. For instance, the 1.0 NER algorithm simply consist in using a dictionary of known terms to search and extract the matching occurrences.
This article presents the implementation and training of a model inspired from a former state of the art model for Named Entity Recognition : Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNNs, 2016.
To tackle this challenge, we used a set of annotated documents, issued from public annual reports of large companies. The goal is to find 4 types of entities in these documents:
- Locations (countries, cities, districts, addresses, etc.)
- Company names
- Group names
- Activity sectors.
Below is an example of a training sample, with its ground truth entities.
Data
Raw data
We used a dataset with 89 public annual reports. These reports represent approximately 54,000 sentences. Entity occurrences are detailed below.
As most NLP applications, textual data is processed by reading its words sequence. Learning the NER extraction on the whole document would be inefficient, as the sequence would be extremely long (thousands of words). Hence, each sentence is processed separately, so that our dataset contains 54,000 samples. This choice won’t change the model learning capabilities, because the context needed to understand an entity is often limited to the sentence itself.
Moreover, treating the sentences independently brings a better generalisation capability: sentences from different documents will be mixed up and gathered into same learning batches.
This large dataset is split into a training set (70%), a validation set (20%) and a test set (10%).
Features
The model we implemented extracts information using 4 word-wise learning features :
- Feature 1: The word. That’s the common way to tackle NLP challenges. Words are preprocessed and mapped to their id in a vocabulary. Words are then embedded as continuous vectors.
["One", "sentence", "with", "5", "WORDS"]
- Feature 2: The word’s text case. We extract the case information from each word : lowercase, uppercase, titlecase, numeric, partially_numeric, etc. This feature is useful as some entities are highly correlated to case information. For instance acronyms will be uppercase and are unlikely to carry an activity sector information, while company names are likely to be titlecase.
[titlecase, lowercase, lowercase, numeric, uppercase]
- Feature 3: The word's Part-Of-Speech tag. Grammatical data definitely helps getting information on the entities : for instance, a company name is very unlikely to be a verb, an activity sector might be mainly common names, etc. We used the NLTK library to extract these information
[CardinalDigit, Noun, Preposition, CardinalDigit, Noun plural]
- Feature 4 : The word's characters. Each word is processed as the sequence of its characters. This is helpful to understand natural language patterns: for instance, common language patterns might be more frequent in activity sectors than in location names. Moreover, this allow a better generalisation for OOV (Out-Of-Vocabulary) words, as we can still extract information from their chars.
[[O,n,e], [s,e,n,t,e,n,c,e], [w,i,t,h], [5], [W,O,R,D,S]]
Each of these features is a sequence element : each word of the origin sentence maps a time step in the 4 features. Thus, one training sample will be represented as 4 sequences of features.
Labels
The NER task aims at predicting an entity type for each word of the sentence. Each token can be assigned to either the O category (the word is not an entity we are looking for), or one of the four entity types : Location, Group, Company, Activity. One label is then simply a sequence of tags, such as this one :
[O, O, O, Company, O, Group, O, O, O, Location, Location, Location]
Preprocessing
Some preprocessing operations are applied to clean an normalize the text inputs:
- After extracting the case information for the feature 2, the text is lowercased for creating the embedding representations
- Tokenization : We used the WordPunktTokenizer function of NLTK.
- DIGIT : An important amount of tokens are numeric, such as dates or amounts. In order to reduce the size of the vocabulary, an usual preprocessing operation consists in replacing digit characters by the special token “<DIGIT>”. For instance, the year “2019” will be replaced by “<DIGIT ><DIGIT ><DIGIT ><DIGIT >”. This way, we extract only the formatting information from these tokens, not the value. This operation reduced the amount of numeric tokens from 11,000 to only 1,000 tokens.
We didn't remove stop words, because they can be critical for understanding entities in a sentence : for example, locations are very likely to be preceded by a preposition such as "in", "to", etc.
What's more, we didn't apply lemmatisation and stemming processes as they might cut sensitive information.
To wrap-up, below is an overview of 5 training samples as they are fed in the network.
Filtering
After a few rounds of trainings, we also decided to filter out some samples according to the following criteria:
- Too long sentences, that brings both learning and memory challenges (too long sentences will require important volumes of padding in the tensors)
- Some "empty" sentences that don't contain any entity. 51% of the sentences (~28,000 sentences) contains only "O" tags. Although teaching the model to recognize empty sentences is important, the 51% proportion might unbalance the training phase. Hence, 80% of the empty sentences of the training set are dropped.
Vocabulary management
Correctly managing the size of our vocabulary is crucial, as it highly correlates with the number of trainable weights. Embedding weights are a major part of the network’s training effort. Three strategies can be applied to choose a vocabulary:
- Building a vocabulary from the training set documents.
Approximately 33,000 tokens compose the training set. This vocabulary might be cut to a lower size: removing unfrequent words from the vocabulary ensures a better generalisation capability, because the model will get used to see unknown words. Hence OOV words are replaced by the special token <UNKOWN>. This token, as the other words, has a representation to be learnt in the embedding matrix.
This method handles perfectly the dataset specifics, but might be stuck at inference phase with OOV words.
- Using a ready-to-use vocabulary.
When using pre-trained embeddings, common word vocabularies are available. As for the previous method, a maximum vocabulary size can be defined to have a sufficient volume of OOV words in the training corpus. This method generalizes quite well to unknown texts, as most common words might have a representation. However, the corpus specifics are uncovered, for instance the DIGIT-based words carrying information about percentages, dates, etc.
- Hybrid method combining pre-trained and custom vocabulary
This is the approach we developed. In order to take advantage of both methods, we built a hybrid vocabulary from the pre-trained GloVe vocabulary and our dataset words. GloVe words ensure a generic language comprehension of the texts, while the corpus specific words bring some specifics about the corpus nature, for instance for DIGIT-based tokens, acronyms, some specific vocabulary designating companies (e.g. Company XYZ Ltd.).
The way we crossed these two vocabularies is illustrated below.
Modeling
The following elements detail an implementation inspired from Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNNs, 2016 network. The architecture's overview is detailed below.
Features Embedding
The sample features are first embedded into a continuous space (Embedding layers). Except for the character-wise feature, the input data is a sequence of ids representing indexes for a look-up operation. Each timestep of these sequences is represented as a continuous vector. These continous vectors are stored in embedding matrices, whose weights are trainable.
- For words (feature 1), we use the GloVe Embedding (dimension 50) as pre-trained weights for the 30k GloVe words. For the 5k specific words, a random vector is initialized by sampling a uniform distribution. All the vectors are trainable, and will be updated during training. The final embedding matrix hence contains 35,000 vectors of dimension 50.
- For the casing and POS inputs (features 2 & 3), possible values are very limited: there are 5 possible casing for the words, and 30-ish POS tags in the NLTK package. These values are still represented as continuous vectors in embedding spaces. We used 3 dimensions for the casing input, and 5 dimensions for the POS embedding. This process replaces the common One-Hot-Encoding process for categorical features, by bringing more understanding to the feature's values.
Char CNN
Last but not least, the character-wise input (feature 4) requires a specific treatment. Indeed, as sentences are represented through 2 nested sequences (words & chars), they cannot be embedded easily.
In order to compute a continuous representation of the same nature, a CNN is applied: each character is embedded in a character embedding matrix, of dimension 10. Then, a 1D Convolution layer processes the sequence of embedded char vectors, followed by a MaxPooling operation. This way, each word gets a vector representation (whatever its length).
Of course, both the character embedding weights and the CNN filters are trainable. We set up filters of width 3: an odd number helps keeping some symmetry in the chars pattern research.
Concatenate
Finally, the 4 continuous representations are concatenated so that each timestep is represented with one vector, of size:
50 (feature 1)+ 3 (feature 2)+ 5 (feature 3) + 10 (feature 4) = 68
Training samples are hence fed to the recurrent layer as a tensor of shape (sentence_length, 68).
Bi-LSTM
The training sample is now represented by a sequence of vectors. A bidirectional LSTM network (dimension 275) is applied on this sequence of vectors. As the label is a sequence of tags, we keep as output each LSTM cell’s hidden state. 2 LSTM layers are used.
Dense Layer
At each timestep, a Dense layer (32 units) extracts information from the LSTM hidden states. Weights are shared across timesteps. A Softmax activation over the 5 possible NER tags computes the final probability distribution.
Training
Batch generation
Training this model requires some adjustments on the batch generation process.
First, to avoid having too much disparity in the sentence lengths in batches, sentences are bucketed by length (Khomenko et. al., 2017). The idea is to gather sentences of similar length in the same batch, to avoid having inefficient padding values.
Second, the feature sequences need to be stacked, so that only one tensor enters the first layer. The char feature has an additional tensor depth: the Keras input layer is fed with two separate inputs to avoid useless depths on the features 1, 2 and 3.
Moreover, in order to further improve network’s generalization, UNKNOWN tokens are randomly used as a replacement to sentence words. The idea is to get the model to regularly see unknown words and take advantage of surrounding context.
Loss function
An averaged categorical cross entropy is used: at each timestep, a categorical cross entropy value is computed. The values are then averaged to get rid of the sequence length, otherwise long sentences would have much higher costs that short sentences.
Padding tokens are represented as [0, 0, 0, 0, 0], so that they don’t trigger any quantity in the loss calculation.
Metrics
A real metric is needed to assess the model's performance. For this multi-class classification, a class-wise F1 score is relevant. This score is computed on each token: we considered one separate classification for each distinct token of the dataset.
The following confusion matrix stores the model's performance for the previous sample.
Hence, we can simply cumulate each sample's prediction results into the same confusion matrix, and compute Precision, Recall and F1 scores.
What's more, this method also allows to compute a score for each label category, which is very precious for tuning the model parameters.
Training
We run the training with the following hyper parameters:
- Batch size : 16
- Learning rate: 0.001
- Optimizer: Adam
- Number of iterations: 10,000
Results
Hereafter are the first training curves we obtained. The model quickly overfits (from epoch 10). Although the model isn't learning much, the strong convergence of the training curve ensures a correct implementation.
These first runs reached a validation F1 score of approximately 78%.
To tackle the overfitting issue, various implementation choices have been made :
- Using a pre-trained embedding. At first, the embedding was randomly initialized (sampled from uniform distributions). Unfortunately, the time needed to learn a correct representation is too long, and the model overfits even before having learnt correct word vectors. What's more, unknown words of the validation set have poor representations. We then decided to use GloVe word representations (dimension 50) to already have correct representation of most words of our corpus.
- Dropout. We added dropout operations in the Dense layers. Dropout is also added on recurrent connections of the LSTM, with a much smaller rate.
- Reduce weights amount. Overfitting occurs mainly because the model is too complex. We tried to reduce the model's complexity by reducing LSTM dimensions and vocabulary size.
After updating these elements, the final convergence curves we obtained are much better. The model generalizes well.
Final model's performances on the test set are displayed below. The final model reaches a validation F1 score of 82,3%, bringing a strong improvement over the first trainings.
We can compute an average F1 score weighted by the classes' support. Excluding the "O" category that would unbalance the result, our model achieves a very good test F1 score of 76,4 %.
Some examples
Let's get a glance on a couple of extractions. We developed, along with the AI models, a user interface to access the model's predictions on a text sample.
The model manages to extract many entities of different type and complexities. In particular, performances on activities are very satisfactory, as they contain simple and common words: indeed, production wouldn't be tagged as an Activity, while electricity production is definitely an Activity sector.
Next steps
Next steps would consist in improving the model to current state of the art methods, implementing for instance:
- Further model fine-tuning (embedding & layer dimensions)
- Attention mechanisms
- OpenAI Transformers
- Contextual Language Models (ULMFit, ELMO, BERT, etc.)
- Conditional Random Fields
References
Similar research works are underway at ILLUIN Technology. Some of them are already shared on our blog page.
Our team is constantly growing. Interested? We're hiring!
https://www.illuin.tech