Robustifying Multi-hop QA through Pseudo-Evidentiality Training

Published in

SNU AIIS Blog

10 min readMar 26, 2022

By Sue Hyun Park

Every day we search for an answer to a question. Search engines like Google is a standard, and recently virtual assistants like AI speakers come in handy. These implement question answering (QA) mechanisms that process natural language questions and construct answers through the query of a collection of natural language documents.

The Multi-hop Question Answering (QA) task is gaining importance as complex questions require connecting information from several texts. An answer is deduced after capturing multiple relevant facts, each representing evidence.

Recent multi-hop QA models are trained for answerability to predict the correct answer if the answer exists in texts. However, this practice focused on giving out an answer raises a reasoning shortcut problem. Previous works point out that these models exploit disconnected reasoning in which they selectively assess and combine information that happens to be far from evidence. The predicted answer is via bad reasoning, almost by “cheating”!

Let’s say we ask a QA model which country got independence when World War II ended. Assume the model doesn’t have a knowledge base and the passage being searched is exactly one sentence that contains the answer “Korea”. As below, even though the passage lacks relevant information about the time WWII ended, the model simply figures out that the answer should be a form of country and predicts “Korea” in the passage should be the right answer.

An example of a reasoning shortcut. If the model was truly following the reasoning process, it should’ve addressed that it is unable to answer.

A multi-hop QA model can guess the answer but fail to understand the underlying reasoning process.

To this end, we propose to supervise evidentiality by training the QA model to recognize whether its answer is supported by evidence. The model itself learns to connect the logical link between a given question and the right answer by discovering influential sentences. This post first explains our multi-hop task setting the QA model will be tested on. Then we introduce our novel method for generating training examples without human annotation and for increasing the robustness of our model.

Our Multi-Hop QA Task Description

We follow the distractor setting in HotpotQA, a dataset comprising 112k questions that require finding and reasoning over multiple supporting documents to answer.

Each question has a candidate set of 10 paragraphs:

2 positive paragraphs P^+, where supporting facts for reasoning are scattered in two paragraphs to avoid single-hop reasoning.
8 negative paragraphs P^- containing no evidence, i.e., the distractors.

This task aims to aggregate relevant facts from the candidate set and estimate a consecutive answer span. For task evaluation, the estimated answer span will be compared with the ground truth answer span.

Generating Examples for Supervision

We build four different types of passages to train our QA model for both answerability and evidentiality.

For predicting the correct answer,

answer-positive set 𝔸+ is a set of passages with both the answer and complete evidence.
answer-negative set 𝔸- is a set of passages with neither the answer nor evidence.

For detecting a reasoning chain assuming a correct answer exists,

evidence-positive set 𝔼+ is a set of passages “expected” to have all pieces of evidence that contribute to an explainable answer.
evidence-negative set 𝔼- is a set of passages with the answer but no evidence. If a model deduces the correct answer from this set, a reasoning shortcut has taken place.

Generating sets for answerability is simple. For the answer-positive set, we concatenate two positive paragraphs P+; for the answer-negative set, we concatenate the negatives P-.

However, constructing examples to supervise evidentiality without human-made annotation requires an additional setting. Here we first define labels of evidentiality V_E. The corresponding set of passages that satisfy each condition is stated on the right.

V_E is a label of evidentiality, indicating whether the chain of evidence for answering A to question Q is sufficient in the passage D.
(Q, A, D) is a set of question Q, answer A and passage D.
E_* is the set of ground truth evidence to infer answer A.
S_* is the sentence containing answer A.

Generating Evidence-Negative Set

As aforementioned, instead of human annotations, we generate “pseudo-evidentiality” annotations to characterize each training set. First, for evidence-negative set 𝔼-, we modify answer sentence S_* and unanswerable passages in negative paragraphs P- to generate examples with the following three contexts.

Generating “Pseudo” Evidence-Positive Set

Second, for evidence-positive set 𝔼+, we let our trained model find evidential sentences by itself. Here is the idea: given that a passage initially contains the correct answer A, each sentence’s influence in predicting A is measured by a saliency score of answer confidence. For example, using a particular sentence predicted to have 70% answer confidence, we have a 70% chance(“confidence”) to provide a correct answer. Then we may interpret the sentence having high answer confidence to be causally salient, i.e., evidential in our case.

However, the multi-hop QA task lists multiple sentences as evidence. Hence we observe confidence predictions made in a group, which is an aggregated set of candidates for evidence. Also, to conceptually capture causation, we utilize counterfactual changes in answer confidence with and without evidential sentences. The sentence that has the greatest confidence change is casually salient and is confirmed to be included in the set of evidence.

The process above is carried out by our proposed Interpreter module. Our QA model first learns the three types of examples 𝔸+, 𝔸-, 𝔼- generated before and moves on to interpret salient features in sentences. First, we initialize set E with A, which will store salient sentences. Next, Interpreter performs iterative insertion and erasure of each sentence S_i in the passage. It calculates answer confidence of the following cases:

observational case: S_i is inserted into evidence set E
counterfactual case: S_i and sentences in set E are all removed from passage D.

The Interpreter adds the sentence of maximum answer confidence change between the two cases to set E and repeats the cycle until 5 evidential sentences are found. The final sentences in E construct a pseudo-evidence-positive set 𝔼+.

Again, note that we name the generated result “pseudo” because the machine’s interpretation may not be perfect and does not guarantee 100% recall of evidentiality.

Learning Answerability & Evidentiality

Our base QA model adopts an existing model and uses RoBERTa’s architecture to supervise answer and answerability.

Since the base model is reported to take reasoning shortcuts, we supervise evidentiality with two objectives for unbiased models:

(O1): QA model should not be overconfident on passages with no evidence (i.e., on 𝔼-).
(O2): QA model should be confident on passages with both answer & evidence (i.e., on 𝔼+).

To pursue (O2), we train the base model on 𝔼+. We initially train without 𝔼+ to feed knowledge into our model, which will only then be able to reliably extract 𝔼+ through the Interpreter module. Now we retrain our model with all generated examples. The figure below is an overview of our complete training process.

Training QA model for evidentiality under the supervision of Interpreter — Training QA model for evidentiality under the supervision of *Interpreter*

How do we realize (O1) in this process? Suppressing overconfidence requires a special mechanism employing a regularization term. However, it has been reported that suppressing data with biases has a side-effect of lowering confidence in unbiased data. Similarly, in our case, keeping the confidence low for 𝔼- undesirably drags down confidence for 𝔼+. Our solution is that if the positive correlation is the problem, decorrelating two distributions on 𝔼- and 𝔼+ should fix it. Therefore, we choose to purposedly train a biased model and decorrelate the target model from the biased model. It’s like we teach a child not to follow others’ biased opinions but to stay upright.

Our model contains two types of predictors to analyze hidden state $h$ where the existence of evidence is unknown. Predictor $f$ is trained to correctly learn the target distribution P. On evidence-negative set 𝔼-, f fails to capture any evidence, so it concludes that the probability to find evidence is uniform at each position. Predictor g, on the other hand, outputs a biased answer distribution P^ on set 𝔼- because it is trained to be overconfident even when there is no evidence. We then regularize distribution P by maximizing Kullback-Leibler(KL) divergence from P^ where the loss is optimized only on set 𝔼-. Shortly, our unbiased model:

has a predictor trained not to emulate the biased answer span on set 𝔼-→ (O1)
while maintaining the target answer span for set 𝔼+ → (O2)

Our QA predictor learns a decorrelated feature on biased examples.

Passage Selection at Inference Time

The goal of our multi-hop QA task is to find answerable passages with both answer and evidence. While we can access the ground-truth answerability in the training set, we need to identify the answerability of (Q, D) pair at inference time. We choose two directions to obtain answerable passages that the model will predict an answer from:

Paragraph Pair Selection specific to HotpotQA’s distractor setting: Each question has a set of 10 paragraphs. Assume all possible pairs of paragraphs. There is only one pair that contains both answerable paragraphs P+. We let the model select one pair with the highest estimated answerability (denoted as paired-paragraph) by means of base model structure. The predicted answer from the selected pair is likely to be evidential.
Supervised Evidence Selector trained on pseudo-labels: We first follow a prior method to extract sentences. Then we train a binary classifier that identifies whether each sentence is evidence-positive or negative. At inference time, we select the top 5 evidential sentences (denoted as selected-evidences) and insert these into our QA model.

Additionally, to show the robustness of our model, we construct a challenge test set by excluding examples that a QA model can easily take shortcuts. To detect such “easy” examples, we build a set of single-paragraph so that none of it is evidential according to HotpotQA’s distractor setting. If the QA model predicts the correct answer on the single-paragraph, there has been a reasoning shortcut, so we remove such examples in HotpotQA.

Experiments

We evaluate our method based on three criteria: the effectiveness in multi-hop QA task, the effectiveness of our Interpreter’s pseudo-evidentiality labels, and the potential to avoid reasoning shortcuts on general data. Our implementation settings for the QA model follow RoBERTa. We extract the evidence-positive set after 3 epochs and retrain our model for 3 epochs.

Multi-hop QA Effectiveness

On both evaluation sets, our model outperforms the baseline. The SOTA model exhibits accuracy gains when combined with our evidentiality training approach (C-II). Also, F1 scores show the best improvement when tested on selected-evidences by our method (O-III), which proves our elimination of irrelevant sentences does well even without annotation.

Our ablation study also supports that our method of training pseudo-evidentiality labels and biased features can increase QA performance.

The comparison of the proposed models on the original set and challenge set.

Baseline model(B): single-paragraph QA model
Competitor(C): the state-of-the-art model that uses external knowledge of reasoning paths and a graph-based retriever
Evaluation set:
- Original set: HotpotQA
- Challenge set: original set but instances where the baseline model predicts an answer without right reasoning (F1 > 0) are excluded

Effectiveness of Pseudo-Evidentiality Annotation by Interpreter

The Interpreter outperforms the baseline in terms of F1 and recall, which is significant because aiming at identifying all pieces of evidence is critical for multi-hop reasoning to determine answerability. However, it selects more accurate pieces of evidence without loss term R^, i.e., biased layer g (b). By training our model layer $g$ for biased features, there is a trade-off between degrading Interpreter’s evidence-selection performance and improving overall QA performance.

The comparison of the proposed models for evidence selection.

Baseline model: Retrieval-based AIR, which performs unsupervised evidence selection like our Interpreter Accumulative-based interpreter on our QA model: our QA model but Interpreter replaced — when generating pseudo-evidence-positive sets, the model follows an existing approach that only performs insertion (not erasure).

Generalization: Avoid Reasoning Shortcuts even in Unseen Data

We observe confidence distribution of models on evidence-positive and -negative sets, the confidence scores sorted in ascending order. The colored area in the figure indicates the dominance of confidence distribution. Revisiting our objectives,

(O1): The area on evidence-negative set 𝔼- should be small (lower confidence be dominant)
(O2): The area on evidence-positive set 𝔼+ should be large (higher confidence be dominant)

Our full model (c) has a greatly enlarged area on 𝔼+ while keeping the area on 𝔼- significantly smaller than that of the baseline. When compared to (b), again we spot the side-effect of using biased layer g, seeing that the area on 𝔼- is slightly bigger. Still, performance gain on 𝔼+ is prominent and the comprehensive performance is estimated optimal under the full implementation of our proposed methods.

Confidence Analysis: Confidence scores of three models in the ascending order, on 𝔼+ (light color) and 𝔼- (dark color). (a) The base model trained on single-paragraphs. (b) Our model with R. (c) Our full model with R^. (d) Comparison of three models on 𝔼+.

Conclusion

Today’s AI does marvels in predetermined datasets. But it falls short of wholly outlining the intricacies of the real world. Especially in QA, service-level issues including implausible answers and dataset bias persist. These are partly by-products of disconnected reasoning.

Multi-hop QA models can learn evidentiality as well as answerability to avoid taking these reasoning shortcuts. For supervising evidentiality, in contrast to feeding rather expensive human annotations, we implement a bidirectional method for the QA model to discover evidence on its own. Through regularization, we also ensure the model to not be overconfident when evidence is insufficient. Our approach is proved to be effective as well as generalizable.

Thorough reasoning and inference are integral to Explainable AI. We continue our mission to make question answering more intelligent and explainable.

Acknowledgment

This blog post is based on the following paper:

Kyungjae Lee, Seung-won Hwang, Sang-eun Han, Dohyeon Lee, Robustifying Multi-hop QA through Pseudo-Evidentiality Training, ACL 2021. (arXiv)

We would like to thank Kyungjae Lee for providing valuable insights to this blog post.

This post was originally posted on our Notion blog, at July 12, 2021.