Back-Translation for Named Entity Recognition

Here we present insights from our Harvard capstone project investigating methods for improving named entity recognition. Although our efforts did not yield the new state of the art in NER, we present the various approaches we tried and what we learned from them.

The original question of our project was whether we could incorporate information from a knowledge base such as WikiData to improve performance on NER. We explore several methods for constructing type-specific vocabularies compiled from the knowledge base and show the non-triviality of compiling and cleaning this data. We then explore several methods of incorporating these vocabularies to learn an NER classifier trained on Wikipedia articles in a weakly supervised way. We demonstrate the challenges of incorporating non-contextual information in a setting where context is key. Lastly, we show how we can incorporate ideas from low resource neural machine translation to improve the generalizability of NER classification.

Data Processing

For this project we used three different pieces of data:

1. NER benchmarks

These datasets are corpus of text with, for each word, a tag associated describing the word. We use two different NER datasets CONLL-2003 and WNUT-2017.

The CONLL dataset is a corpus of text with various sentences, each word labelled as either a location (LOC), an organization (ORG), a person (PER) or a miscellaneous entity (MISC). All the words that don’t fall into one of these categories is labelled as O. Although there is some structure for the LOC, ORG and PER categories, the MISC category gathers various words that don’t have much features in common.

In the same vein, the WNUT dataset has a similar structure but with different labels. Each word is labelled as either a location (LOC), organization (ORG), person (PER),

2. Knowledge Graph:

As we want to augment out datasets, we will use a knowledge graph, Wikidata, which contains all the information from wikipedia. All the knowledge is represented as a set of nodes which represent entities (Jeff Bezos, New York, Apple,…) linked through several types of relationships (subclass of, is owned by, instance of,…)

3. Unlabelled Corpus:

Eventually, we will also use a subset of the text contained on wikipedia, called wikitext-2. This corpus contains 600 articles from wikipedia, and totalizes 2M tokens.


Our objective is to improve the performance of models on NER tasks, hence we will evaluate the performance of our models on the regular NER benchmarks: CONLL and WNUT. However, in order to train these models, we will create some vocabularies to label some unlabelled text. Namely, we will use vocabularies built from CONLL and Wikidata to provide labels for the wikitext-2 corpus.

A vocabulary is defined as a list of words for each entity type used to train models. To generate the synthetic dataset with wikitext-2, each word will be labelled by the entity type of the vocabulary to which it belongs.

This signing procedure has two differences from regular NER datasets:

  • As each word can appear in two vocabularies, we can have multiple labels for each word
  • The labelling is done independently of the context

We will discuss the impact of these two features later on.

Vocabulary 1, CONLL:

The first vocabulary we use is the vocabulary induced by CONLL tagging. The CONLL dataset provides a contextual label for each word appearing in the corpus. We thus collect all the different labels associated to each word and define a vocabulary based on this approach.

We obtain the following intersection matrix using this procedure, with ‘O’ indicating the words without labels in CONLL.

Vocabulary 2, Wikidata:

The first vocabulary is efficient in the sense that it provides accurate labels, but it is relatively small. To solve for this, we leverage the large knowledge graph we have access to. For this, we explore Wikidata using heuristics to select all the entities (nodes) that pertain to a certain category (ORG, PER, etc).

For example, in order to gather all the ORG entities, we look for all the synonyms of ‘organization’, then select all the instances of these synonyms. Namely, we look for all the instances of ‘association’, ‘company’, ‘political group’, etc.

We proceed similarly for PER and LOC, but with a slightly different approach for MISC. As the MISC category is somehow specific to the CONLL dataset, and does not have a real structure, we proceed in two steps:

  • For each word in the MISC vocabulary from CONLL, we look in wikidata what the instance it belongs to is, by looking at the parent node through the relationship ‘is an instance of’.
  • Given this list of classes, we look at all their instances in wikidata to create the MISC vocabulary for this dataset

This method provides a very large number of candidates for the vocabulary and leads to a large number of false positives, but it is a way we found to find structure in this blur category of miscellaneous.

Applying the previous steps, we are finally able to build this intersection matrix. As noted above, there is a very large number of words in the MISC category.

In order to mitigate with this imbalance, we proceeded incrementally by removing some classes in our vocabulary definition. To do this, we iterated between the model performance and a fine-grained analysis of the model predictions to progressively refine the selection process for our vocabularies.


Our main benchmark for NER classification is simply using training set labels on the test set. If more than one label exists for a given word, we perform a random selection of the multiple tags available. If no tag exists, we randomly assign one. See the scores below on our CONLL Benchmark:

F1 Score Micro = 0.8415

F1 Score Macro = 0.5655

F1 Score Weighted = 0.8835

Modeling Approaches

Below we explain the several modeling approaches we took to try to improve upon the state of the art for Named Entity Recognition.

Knowledge Augmented Language Model based NER (KALM)

For our first experiment (which, to avoid any suspense, did not work out for us), we adopt the model proposed in Liu et al (2019)[1]. At a high level, the model optimizes the objective function for a masked language model, and in the process learns the NER tag of a word as a latent variable. Specifically, we use a bidirectional LSTM based architecture to generate contextual embeddings for the left and right contexts of each token in a sentence. Then we use those embeddings to predict a probability distribution over the five CoNLL types (PER, LOC, ORG, MISC, O). We use the WikiText-2[3] training set as the source of sentences. Then for each of the five NER tag types, we predict a distribution over words in that NER type’s vocabulary (we mined these type specific words from WikiData, as explained in Section 2). Finally, since we do not have the ground truth labels for the NER tag of a token, we marginalize over the NER type. This architecture is shown below. As indicated above, this approach did not work out for us and we achieved an average F1 score of just 0.21 (which is worse than random). Based on some error analysis, we discovered that the tags we mined out of WikiData using the methods described in the Sections above, did not make any sense. For instance, as the word ‘Of’ also represents a city in Turkey, our vocabulary building method assigned the tag LOC to it, whereas it should’ve been assigned ‘O’ (as per its use case in most sentences in English).

KALM Architecture

Multi-Label Classification with BERT

As an alternative approach that utilizes non-contextual labels, we train a simple multi-label classifier on top of BERT contextual embeddings on the WikiText-2 text corpus. This method closely reflects the way that BERT is often used for NER today. Typically, a token-level classifier is placed on top of BERT embeddings and a classifier is trained on an NER dataset such as CoNLL.

In order to adapt this for our vocabularies, we train a generic unlabeled text corpus and predict whether each word is found within each of our pre-compiled vocabularies. Although these labels are inherently non-contextual, we hoped that by locking the weights on BERT, the contextual representations would enable the model to generalize. Unfortunately, this method was unsuccessful. When we generate labels from the training corpus, we get an F1 score of .401. When we employ the much larger, but messier, labels compiled from our knowledge base, this method doesn’t even achieve our basic baseline, getting an F1 of .131. Clearly, we needed to look for other approaches.

Translation Approach

After conducting the aforementioned experiments and some error analysis, we realized that the key issues facing us were: (1) lack of context in NER labels to train on (e.g. with the multi-label approach, each word would have the same label irrespective of the sentence it appeared in), (2) noise in NER labels due to limitations of our label mining methods (e.g. words like “Of” can get assigned the label LOC in our mining process because “Of” is the name of a city in Turkey!) Given the limited amount of properly labeled data available to us (in the form of the CoNLL-2003 dataset), we look at the Neural Machine Translation world for inspiration. We find that a technique called Back-translation[2] is used for low-resource language pairs. The figure below illustrates how this works for a low resource language pair like Nepali-English, where the task is to translate a sentence in Nepali to English.

We adopted this technique for our NER labeling task. The steps are described below.

We evaluate the two models — Model-1 (trained on CoNLL-2003) and Model-2 (trained on WikiText-2 with labels assigned using Model-1) on two test sets: (1) the CoNLL-2003 Test set, and on (2) the WNUT-2017 Test. While on the CoNLL Test set, Model-1 had an F1 score of 0.91 and Model-2 lagged behind at 0.83, it wasn’t a complete surprise as we expect the domain of sentences in CoNLL Test set to be similar to that of the CoNLL Train set, and so we would expect Model-1 to perform better on it. We believe evaluation on the WNUT-2017 Test set to be a fair comparison, and we found that Model-2 outperformed Model-1 by 0.02–0.03 F1 points for the tags LOC (Model-1 0.55, Model-2 0.574), PER (Model-1 0.641, Model-2 0.677), and ORG (Model-1 0.200, Model-2 0.227 ).


Our conclusions from this project are that the results from paper “Knowledge Augmented Language Models” are poor with large knowledge base vocabularies. Multi-label classification on top of BERT also does poorly since it encourages the model to ignore context. And Backtranslation-style methods do well on CoNLL with superior generalizability to W-NUT Emerging Entities.

Following these conclusions, our main takeaways are that constructing sets of non-contextual NER tags (vocabs) from knowledge bases is extremely challenging. Over-reliance on non-contextual labels encourages the model to ignore context and degrades NER performance. And back-translation approaches don’t improve results in-sample but may improve generalization performance from CoNLL to W-NUT.


[1] Liu, Angli, Du, Jingfei, and Stoyanov, Veselin. Knowledge-Augmented Language Model and its Application to Unsupervised Named-Entity Recognition.

[2] Sennrich, Rico, Haddow, Barry, and Birch, Alexandra. Improving neural machine translation models with monolingual data. CoRR, abs/1511.06709, 2015. URL

[3] WikiText-2, URL


Paxton Maeder-York was previously a product manager at Auris Health. He has a B.S. in BME, an M.S. in CSE, and an M.B.A. from Harvard University.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store