A Primer on Open-Domain Question Answering (ODQA) — Part 1

Neeraj Varshney
10 min readJul 23, 2022

--

Question Answering task requires developing systems that can answer questions posed by humans in natural language. In the Open-Domain Question Answering task (ODQA), questions could be about nearly anything relying on world knowledge. In ODQA, the challenge is that the context containing relevant information about the question is not provided. This is in contrast to the standard reading comprehension task (such as SQuAD) in which a passage containing the answer span is provided with the question. Thus, ODQA is more realistic and requires the system to retrieve text relevant to the question. This text could be retrieved from unstructured sources (such as web documents, books, news articles, and Wikipedia), structured sources (such as tables, graphs, and knowledge bases), or from different modalities (such as images and videos). Usually, in ODQA, factoid questions are considered that have short and concise answers unlike long-form or non-factoid questions.

Examples of Open-domain questions:

  • Where was Barack Obama born?
    Answer: Honolulu
  • Where does the energy in a nuclear explosion come from?
    Answer: high-speed nuclear reactions
  • How many teams played in the 2018 FIFA world cup?
    Answer: 32

Interesting Applications of ODQA:

  • Answer box in google search
Figure 1: Answer box in google search.
  • Answering user questions in digital assistants such as Alexa and Siri.

Check out my paper on open-domain QA Can Open-Domain QA Reader Utilize External Knowledge Efficiently like Humans?

Preliminaries:

  • Stanford Question Answering Dataset (SQuAD):
    SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding passage.
    SQuAD has 87k training and 10k evaluation data instances.
    — Each instance consists of a passage, a question, and an answer span from the passage.
  • Using BERT for Extractive Question Answering
Figure 2: Using BERT for Extractive Question Answering.

— For Extractive QA, BERT needs to highlight a “span” of text containing the answer i.e. it simply needs to predict which token marks the start of the answer, and which token marks the end.
— Each training instance consists of a passage, a question, and an answer span. During training, the model learns to predict the answer span (the start and the end token) i.e. it learns two separate dense layers (on top of BERT) to predict two probabilities for each passage token: first for the token to be the start of the answer span and the second for it to be the end of the answer span. The vector representation of each token from BERT is fed to the two dense layers to compute these probabilities. During training, the log-likelihood of the correct start and end positions is optimized independently.
— At inference time, the overall prediction probability of a span is calculated by taking an average of start probability of the first token and end probability of the last token of the span.
— At the end, a softmax over the probabilities of all the spans is taken and the one with the highest probability is outputted as the answer.
— Interested readers can refer to this article to understand the training process in detail and also can check out this colab notebook.

  • Approaches for ODQA:
    Two-stage Retriever-Reader Approaches: In the first stage, relevant documents are retrieved, and then the reader model finds the answer from the retrieved documents.
    Retriever-Generator Approaches and Retrieval-Free (Generator) Approaches
  • Evaluation Metrics
    — for Retrieval System:

    Recall — the fraction of questions for which the correct answer appears in any of the top K retrieved documents/segments.
    for Reader System:
    EM (Exact string match with the correct answer) and F1 score (weighted average of precision and recall at the token level).

In this part of the article series, we’ll discuss four papers that are important on the road to understanding more recent approaches (covered later in this series).

Paper 1: Reading Wikipedia to Answer Open-Domain Questions (Chen et al., ACL 2017)

Key Contributions:

  • Splitting the Open-domain QA system into Retriever and Reader stages and using Wikipedia to retrieve relevant documents.
  • Using SQuAD data for evaluating ODQA systems: SQuAD provides the relevant passage with the question, however, in ODQA the relevant documents need to be retrieved. They use SQuAD for evaluating ODQA systems by not allowing access to the given paragraphs of the evaluation instances. Thus, forcing the system to retrieve relevant documents.
  • Proposed to use Distant Supervision to compile more training data (in addition to SQuAD training data) from other QA datasets.

Summary:

  • Uses Wikipedia as the knowledge source for open-domain questions as the answer to any factoid question is a text span in a Wikipedia article. For each page, they extract only the plain text i.e. all structured data such as lists and figures is removed.
    Total Number of articles = ~5M.
  • Proposed System: DrQA (pronounced as DoctorQA) consists of two components: (1) Document Retriever (for finding relevant articles) and (2) Document Reader (for extracting answer spans from retrieved documents).
Figure 3: DrQA system for open-domain question answering.
  • The proposed document retriever uses bigram hashing and TF-IDF matching; the document reader is a multi-layer RNN model.
    Interested readers can find details about these methods in the original paper. However, we skip them in this article as better methods have now been developed (such as DPR for retrieval and BERT/T5 for Reader).
  • Evaluation of the proposed method is done using the SQuAD data. The SQuAD dataset contains (question, answer, paragraph) triplets. The document reader model is trained using the SQuAD training data (question, answer, paragraph). However, at evaluation time, the paragraphs retrieved by the retriever model are given as context to the reader model (instead of directly giving the corresponding passage that has the answer span) and performance is calculated based on the model’s ability to find the correct answer from the retrieved documents. Therefore, the performance on the recall of the retriever and ability of the reader to extract the answer from the retrieved documents.
  • The proposed retriever model achieves a good top 5 recall score i.e. the top 5 retrieved documents often contain the correct answer. Furthermore, the reader model that utilizes the retrieved candidates of the retriever model also achieves good EM and F1 performance.
  • They also explore using Distant Supervision to compile more data (in addition to SQuAD) for training the document reader.
    — SQuAD has (question, answer, passage) triplets but other QA datasets such as CuratedTREC, WebQuestions, and WikiMovies contain just the (question, answer) pairs, therefore they can not be directly used to train the reader model.
    — In distant supervision, they automatically create (question, answer, passage) triplets from (question, answer) pairs of these datasets. To achieve this, they first run their document retriever on the question to retrieve the top 5 Wikipedia articles, and then filter some paragraphs from these articles using some rules such as if the answer span is not present in them, a named entity mentioned in the question is missing in the paragraph etc. Table 1 shows examples of retrieved paragraphs.
    — Creating more training data using this approach helps the reader model achieve higher performance.
Table 1: Example training data compiled using distance supervision.

Paper 2: End-to-End Open-Domain Question Answering with BERTserini (Yang et al., NAACL 2019)

Summary:

  • Proposed a Retriever-Reader model that uses an IR toolkit (Anserini) as Retriever and BERT model as the Reader.
  • They explore retrieving the following granularities of text:
    Article: 5.08M Wikipedia articles (same as Paper 1)
    Paragraph: articles are segmented into 29.5M paragraphs
    Sentence: paragraphs are segmented into 79.5M sentences
  • k segments are retrieved (of one of the above granularity) and the BERT reader is applied to them.
  • Another difference with Paper 1 is that at inference time, the retrieval score (given by Anserini) is also used along with the reader score (BERT’s prediction score for a span to be the answer) to calculate the overall prediction score for a span to be the answer.

where μ is a hyperparameter.

Figure 4: Architecture of BERTserini.

Paper 3: Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering (Wang et al., EMNLP 2019)

Summary:

  • Multi-Passage BERT: Using BERT when there are multiple passages in the context
    — BERT model described in the preliminary section works well when there is only one passage for a question. However, in ODQA, multiple passages are retrieved and need to be used by the reader.
    — For this scenario, BERT can be used to make predictions for each passage independently, but if it computes the prediction probability of spans from each passage independently then it may cause incomparable answer scores across passages.
    — To tackle this issue, all passages are processed independently but with one important difference: to allow comparison and aggregation of results from different passages, the final softmax layer over different answer spans is removed (Clark and Gardner, 2018).
    — This is done so that the prediction probabilities do not get normalized at the passage level i.e. normalization is done at the global level.
  • Also explores Passage Ranker that reranks all the retrieved passages, and selects a list of high-quality passages for the multi-passage BERT model.
    Training Passage Ranker: The passage ranker is another BERT model, which is similar to multi-passage BERT except that at the output layer it only predicts a single score for each passage based on the vector representation of the first token [CLS].
    — The training data for this ranker can be collected using distant supervision (1 for the passages that contain the answer and 0 for the ones that do not contain the answer).
    — They apply softmax over all passage scores corresponding to the same question, and train to maximize the log-likelihood of passages containing the correct answers.
    — At the end, the overall score of a span is calculated by multiplying Passage Ranker’s score with the Multi-passage BERT probability on that span.
  • They also explore the impact of passage granularity i.e. how the articles should be split — passages, fixed-length windows, or sentences.

Experiments and Results:

  • They use ElasticSearch with BM25 algorithm as their retriever.
  • During training, they use top-10 passages for each question plus all passages (within the top-100 list) containing correct answers.
  • During inference, they use top-30 passages for each question.
  • Result: Effect of passage granularity — They split each article into non-overlapping passages based on a fixed length {50, 100, 200}. They found that compared to single-sentence passages, leveraging fixed-length pas- sages works better, and passages with 100 words works the best.
  • Result: Effect of sliding window — Splitting articles into non-overlapping passages may force some near-boundary answer spans to lose useful contexts. To deal with this issue, they split articles into overlapping passages by sliding window. They set the window size as 100 words, and the stride as 50 words (half the window size). This also brings some performance improvement.
  • Result: Effect of passage ranker — The retriever returns top-100 passages for each question. Then, the passage ranker is employed to rerank these 100 passages. Finally, multi-passage BERT takes top-30 reranked passages as input to extract the final answer. Furthermore, using passage score on these top-30 reranked passages gives further improvement.

Paper 4: How Much Knowledge Can You Pack Into the Parameters of a Language Model? (Roberts et al., EMNLP 2020)

Summary:

  • Explored the “closed-book” question answering setting i.e.at inference time, the system doesn’t have access to any external context or knowledge. It needs to leverage the knowledge stored in its parameters (acquired during pre-training and fine-tuning).
  • The other setting where the model can retrieve knowledge from external sources is called open-book QA.
  • They experiment with T5 models and show that this approach performs competitively with open-domain systems that explicitly retrieve documents from an external knowledge source when answering questions.
  • They also explore using salient span masking (SSM) as the pre-training objective. This approach first uses BERT to mine sentences that contain salient spans (named entities and dates) from Wikipedia. The question-answering model is then pre-trained to re-construct masked-out spans from these sentences, which could help the model “focus on problems that require world knowledge”. They experimented with using the same SSM data and objective to continue pre-training the T5 checkpoints for 100,000 additional steps before fine-tuning for question answering and show that it further improves the performance.
  • Colab notebook to try out this model.
Figure 5: T5 is pre-trained to fill in dropped-out spans of text (denoted by <M>) from documents in a large, unstructured text corpus. They fine-tune T5 to answer questions without inputting any additional information or context. This forces T5 to answer questions based on “knowledge” that it internalized during pre-training.

References

Part 1

  1. Reading Wikipedia to Answer Open-Domain Questions (Chen et al., ACL 2017).
  2. End-to-End Open-Domain Question Answering with BERTserini (Yang et al., NAACL 2019).
  3. Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering (Wang et al., EMNLP 2019).
  4. How Much Knowledge Can You Pack Into the Parameters of a Language Model? (Roberts et al., EMNLP 2020).

In Part 2 (ETA: Aug 15), we’ll cover the following papers:

  1. R3: Reinforced Ranker-Reader for Open-Domain Question Answering.” (Wang, Shuohang et al., AAAI (2018)
  2. Ranking Paragraphs for Improving Answer Recall in Open-Domain Question Answering (Lee et al., EMNLP 2018)
  3. A bert baseline for the natural questions (Alberti, C., Lee, K. and Collins, M., 2019).
  4. Passage Re-ranking with BERT. (Nogueira, Rodrigo, and Kyunghyun Cho., arXiv preprint arXiv:1901.04085 2019)
  5. Adaptive Document Retrieval for Deep Question Answering (Kratzwald & Feuerriegel, EMNLP 2018)
  6. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. (Omar Khattab and Matei Zaharia., SIGIR 2020)
  7. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering (Izacard & Grave, EACL 2021)
  8. Dense Passage Retrieval for Open-Domain Question Answering

--

--

Neeraj Varshney

Looking for full-time positions | Ph.D. Candidate working in Natural Language Processing (https://nrjvarshney.github.io)