Zero-Shot Named Entity Recognition Using Question Answering

8 min readOct 16, 2022

TLDR

Named Entity Recognition(NER) is an important task for many natural language understanding systems
Collecting labeled data is a significant barrier to entry to applying these approaches in industry
By treating NER as a question answering problem, we can collect noisy data for model training which can yield good performance without the need for many annotations.
Significantly better results can be achieved when combining this approach with a small amount of human-labeled training data.

Introduction

Named Entity Recognition(NER) is a subtask of NLP in which we attempt to identify entities of interest within a document such as people, locations, companies, etc. An example of the NER task is shown below in which we would like to identify people, places, and organizations present in the document.

NER example (credit: https://www.aiimi.com/insights/aiimi-labs-on-named-entity-recognition)

This task is a fundamental component of many natural language understanding tasks such as:

Identifying diseases within medical documents
Parsing key pieces of information from search queries
Building a knowledge base from unstructured documents

While this is an approach that many companies can leverage, the need for labeled training data is a significant barrier to entry for ML engineers developing these models in industry.

In this review, we will look at the zero-shot NER task to see how we can create domain-specific and high-performing NER models without the need for labeled data.

Data Challenges

Let’s first consider a hypothetical scenario to better understand some of the difficulties of taking a NER project from zero to one. suppose we are tasked with building a customer service chatbot for Apple for helping people with problems related to their Apple products. One piece of this chatbot may need to identify different elements from a report such as the type of device (iPhone, Ipad, Imac, …), the observed issue (screen won’t turn on, low battery life, overheating, …), as well as other relevant information.

The machine learning team proposes to solve this problem using NER and we need to collect labeled data to reliably train and evaluate such as model. This data collection process requires a tool to be set up for annotation, training of human annotators, and multiple rounds of refinement before the real data collection can even begin. All of this process requires a team member to work on and monitor across multiple sprints. Halfway through the data collection process, the product manager adds a requirement for tagging the cause of the issue as an additional entity. This requires that annotators go back and revise previous annotations to get the final data.

One of the many challenges with the adoption of NER in industry is that building a labeled dataset can be costly and time consuming. Collecting data using human annotators is not only expensive but, more importantly, it requires a significant time investment. This is both the engineering time required to define annotation guidelines, set up annotation tooling, and answer questions related to the task as well as the time it takes to annotate.

In addition, the taxonomies of entities that we would like to extract are likely constantly evolving. As our product evolves, we risk rolling our boulder of data labeling up a hill only for it to roll down again when we are asked to refactor the skill ontology.

In an industry setting, this can present a “chicken and egg” problem in which ML leadership won’t want to make a huge investment without knowing the value of the use case but the value can’t be proved without data. This can be especially true for start-ups where speed to market is critical in the prioritization of work.

If we had a quick way to collect noisily labeled data, we would be able to build a POC while circumventing the most time-consuming and expensive part of the development process.

Zero-Shot NER

There are many different approaches to Zero/Few shot NER given in Few-Shot Named Entity Recognition: A Comprehensive Study. Some of these rely on transfer learning from open-domain models. These approaches can be a bit inflexible and the mileage you get from them likely depends on how much the source task matches the target task. Other approaches rely on the collection of noisily labeled data using different heuristics of matching entities from a list. These approaches can require a significant amount of manual wrangling to glean signal from noise in addition to presupposing we have access to a dictionary of entities, which may not always be the case.

In this review, I will present an alternative approach to NER data collection through the use of synthesizing data using a question answering model. By posing NER as a question answering task, we can extract entities in a way that can generalize to new entities and entity types. These ask-to-generate approaches have been shown to perform very well in both zero-shot and few-shot situations. In addition, they provide a natural way to extend to new entity types.

In this review, we will cover

Brief Overview of Extractive Question Answering
Introduce common models for NER
Explore zero-shot NER model training using generative data

Special Thanks to Jinhyuk Lee for answering my questions and providing some of the citations presented here. He has some interesting work and I would encourage everyone to check him out.

Extractive Question Answering

Extractive question answering is the task of, given a question and context, identifying a relevant answer span from within the context (example below).

https://www.tensorflow.org/lite/models/modify/model_maker/question_answer

A fairly popular approach to this problem is to do this using a BERT model by concatenating the question and context together. See the below article for a more detailed explanation which is beyond the scope of this article.

https://medium.com/analytics-vidhya/question-answering-system-with-bert-ebe1130f8def

Question Prompting for Zero/Few Shot NER

In Learning Dense Representations of Phrases at Scale, the authors show how a generic question answering model can be applied to zero-shot entity recognition and relation detection tasks. They do so by prompting the model with a question that is likely to identify the entities of interest. Each query is provided in the form of “{subject entity} [SEP] {relation}” and the answer is the object entity.

Below I have shown an example of how this can be applied to extracting people and locations on the Led Zeppelin Wikipedia page. By posing our task as a natural language question, we are able to identify the members of the band through the “which people?” query.

Question answering based NER applied to Wikipedia article

This exploits many of the advances within the extractive question answering literature. Importantly, it is trivial to adapt this approach to new entity types. If we wanted to add a “music genre” entity, we would just need to add an additional question prompt.

This model can then be further fine-tuned using labeled data to maximize the probability of each annotated entity. The authors show some pretty good results with such an approach and after fine-tuning on 5k and 10k examples respectively they achieved state-of-the-art results.

In Question Answering Infused Pre-training of General-Purpose Contextualized Representations, the authors expand on this work with the addition of:

A large dataset of noisy examples from 80 million (question, answer, context) triplets that were generated from a BART model.
First training a more powerful teacher model and distilling into the final model used as shown in Cross-Architecture Knowledge Distillation

They far exceed other approaches on NER benchmarks in both zero-shot and few-shot learning settings. In their few-shot learning approach, they fine-tune their trained model only using 5 examples per entity type. Code for reproducing results as well as the models is available here https://github.com/facebookresearch/quip

Building a NER Dataset from Scratch

Simple Questions Generate Named Entity Recognition Datasets presents a pipeline for using the presented models for building a NER model from scratch via question answering generated data. Their GeNER approach is illustrated below.

Fig1 from Simple Questions Generate Named Entity Recognition Datasets

This pipeline is comprised of 4 steps

1. Query Formulation

First, we need a natural language question to identify our NER need (ex. What diseases?).

2. Retrieval

For each question prompt, authors apply an open-domain QA model to retrieve candidate entities across a corpus. Authors use DensePhrases to search over all Wikipedia articles to generate a set of candidate sentences and entities.

3. Dictionary Matching

As one may imagine, the initial data is likely very noisy. The authors apply a set of normalization rules to attempt to clean some of the noise from the data. For example, things like punctuation removal and trimming as well as some entity types can start with “the” and others are annotated without it.

4. Self-training

A standard BERT NER model is trained on this generated data.

Results

The authors apply their approach (GeNER) to a variety of NER tasks using only open-domain Wikipedia data for training. They show fairly competitive results to fully supervised methods without the use of either the labeled data or the entity ontology.

Conclusion

NER is a fundamental task within NLP that is useful across a wide range of applications. However, the acquisition of labeled data represents a significant barrier to entry for many companies who might want to apply this to their products.

In this review, we show how we can perform zero-shot NER by re-posing the NER task as an extractive question answering problem. This allows us to use off-the-shelf QA models to collect noisily labeled training data in a way that is both automatic and adaptive to entities and types. Models trained on this noisily labeled data are competitive to fully supervised approaches, especially in the context of a small amount of “gold” labeled data.