Efficient Extractive Question Answering on CPU using QUIP

7 min readOct 24, 2022

TLDR — Extractive question answering is an important task for providing a good user experience in many applications. The popular Retriever-Reader framework for QA using BERT can be difficult to scale as it requires the re-processing of candidate documents in the context of a question in real-time. By using phrase embeddings, we can process questions and context independently which drastically reduces runtime demands. In a limited experiment, I found QUIP to be 4x faster than a comparable QA model on CPU.

Extractive Question Answering

Extractive question answering is the task of identifying a subsequence of text from a document that is relevant to a given natural language question. I’m sure most people are likely familiar with Google text highlighting (shown below) as an example of this. This is an important feature for good user experience in a variety of domains in which we want to search over a corpus of long documents.

Typical Question Answering Models

https://lilianweng.github.io/lil-log/2020/10/29/open-domain-question-answering.html#ICT-loss

The majority of extractive QA literature focuses on the Retriever-Reader framework.

Retriever — The role of the retriever model is to efficiently narrow a large corpus of documents to a small number of candidate documents that are likely to contain the answer.

This is typically implemented as a bi-encoder model which encodes questions and documents to embeddings and models relevance as the cosine similarity between them

https://www.sbert.net/examples/applications/cross-encoder/README.html

Reader — The reader extracts the answer span from a candidate document given the question and document as input. It is important to note here that these models are a function of the question and the context together.

Computational Challenges with Retriever-Reader Models

An issue with the Retriever-Reader framework is that the Reader model can create a significant computational bottleneck. Retrieval is very fast during query time because we can compute and store representations for all documents in the corpus during indexing time.

Reading comprehension models require doing a forward pass with a BERT model not just for the question but for all the candidate documents we want to extract an answer. This means we may have to process 1000s of tokens using BERT in real-time.

This can be overcome with the help of GPUs which are expensive and not available in production for many companies. In this review, I introduce phrase embeddings as an alternative approach to extractive QA which poses the answer extraction as an embedding nearest neighbors problem.

Phrase Embeddings for Efficient Question Answering

Phrase Embeddings

Phrase embeddings are vector representations of a specific span of text in the context of a longer document. This idea was introduced in Dense-Sparse Phrase Index (DenSPI) for doing efficient QA at scale. In the below example, the highlighted text in red and blue each receive a respective embedding. We can then score a question against an answer phrase using the cosine similarity of the respective embeddings. Rather than store a single embedding per document, many embeddings are stored corresponding to different spans of text from within each document.

To do within document search, we can just find the most similar segment to the query embedding using approximate nearest neighbors search (shown below).

Phrase embedding question answering allows us to do all the computation on documents during indexing time which dramatically reduces the number of tokens BERT processes during runtime.

There are several such approaches that have achieved good results using this approach in the ML literature including DenSPI and DensePhrase. Next, we will do a deeper dive into QUIP which is a more contemporary model that achieves some impressive results on a variety of QA tasks.

QUIP: Question Answering Infused Pre-training of General-Purpose Contextualized Representations

QUIP(Question Answering Infused Pre-training of General-Purpose Contextualized Representations✎ EditSign) currently represents the state of the art concerning phrase embedding extractive question answering.

Model Architecture

QUIP first generates start and end embeddings from a question using 2 different heads on top of the [CLS] token. The most likely start and end tokens are predicted by taking the inner product of these embeddings with each token in a given context. The span which maximizes p(start=i) *p(end=j) is chosen as the answer span. Note that the embeddings for the question and independent of the embeddings for the context meaning we can pre-compute the context embeddings for documents in our corpus. This is the critical difference between QUIP and the standard BERT QA models that allows it to be applied efficiently.

Generating Data

Another important contribution of the paper is showing how huge amounts of noisily labeled data can be generated and increase model robustness. Authors train a BART seq2seq model on a wide range of question-answering datasets to generate context, question, and answer triplets.

Example of using BART for generating noisy labels for model training

Authors generate 10 questions from each of 2 million documents from 4 subdomains including (books, Wikipedia, common crawl, and stories). This yields a pre-training dataset with a total of 80 million (question, answer, context) triples.

Results

In addition to a wide range of other applications, the authors show some fairly compelling results on the SQUAD question-answering benchmark.

Practical Discussion of Storage

A significant downside of these approaches is that we need to store an embedding for each token in a corpus. A relatively modest corpus of 10s of thousands of documents could have millions of tokens that can easily consume available memory. Here I offer some practical tips for users looking to utilize this approach in industry.

Compression Techniques — There are many techniques and tools for storing embeddings using fewer bits including Product Quantization. In my experience, embeddings can often be compressed >10x with little loss to performance
Heuristic-Based Filtering — We can choose to store or not store embeddings based on a set of heuristic rules. For example, we may only consider answers at natural text boundaries such as periods or line breaks.

QUIP Example

Let’s take a look at how this model can be applied to real data using the model published as part of the QUIP paper (https://github.com/facebookresearch/quip).

Question Answering Example

Let’s first see the model itself being applied to our running example.

Question: “who was the lead singer of led zeppelin?”

Context: Led Zeppelin were an English rock band formed in London in 1968. The group comprised vocalist Robert Plant, guitarist Jimmy Page, bassist/keyboardist John Paul Jones, and drummer John Bonham. With a heavy, guitar-driven sound, they are cited as one of the progenitors of hard rock and heavy metal, although their style drew from a variety of influences, including blues and folk music. Led Zeppelin have been credited as significantly impacting the nature of the music industry, particularly in the development of album-oriented rock (AOR) and stadium rock.

Predicted Answer: ‘ Robert Plant'

Runtime Speed Comparison

I compare the runtime speed of QUIP with a more traditional cross-encoder question-answering reading comprehension model. I compare deepset/roberta-large-squad2 which uses a base encoder of the same size as QUIP. I limit to 4 torch threads as is a common practice in real applications. Inference for both is done on CPU. For QUIP I precompute the embeddings for the context as we would during indexing time.

QUIP — 104ms
Roberta Large — 419ms

We observe more than a 4x speedup when using QUIP over a standard model. This is of course an extremely limited experiment of a single example that doesn’t account for other runtime considerations like quantization as well as the cost associated with storing precomputed embeddings for tokens in a corpus. However, it does illustrate the potential of these models to offer a significant speedup by allowing us to pre-compute representations of documents.

Conclusion

There seems to be a gap between ML research and industry in extractive question answering in which many of the models published rely on models which can be difficult to run in realistic industry conditions. Many Retriever-Reader style approaches need to process candidate documents in the context of a question during runtime which can drastically slow down inference.

Phrase embedding approaches are an alternative to this in which we precompute representations of context during indexing and model answer relevance through cosine similarity during runtime. While there is much less work done in this area, different approaches have proven to be competitive in performance to their more computationally inefficient counterparts.