E14 : Dense X Retrieval

Published in

Research Papers Summarized

6 min readDec 24, 2023

Breaking a retrieval corpus into smaller pieces that is minimal, self contained, contextualised, and contain distinct facts, improves the quality of retrieval and downstream QA tasks

Paper Name : Dense X Retrieval: What Retrieval Granularity Should We Use?

Paper URL : https://arxiv.org/pdf/2312.06648.pdf

Authors : Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, Dong Yu

Please find the annotated paper here

Problem Statement :

Default choice of retrieval unit in information retrieval process is mostly passages and sometimes sentences.
Passages usually contain more information than needed for a specific question and hence suffer from lack of clarity and reference to entities that are contained in the question.
Using sentences as retrieval unit also face similar problems to passage level retrieval and in addition also suffers from lack of context.

Solution :

Breaking a retrieval corpus into small, distinct facts that are self contained and contextualised pieces of text termed as propositions.
Using more fine granular retrieval units - propositions enhances the density of information contained in them.
Dense useful information in turn means enhanced retrieval results.

A paragraph is broken into propositions which is then retrieved using a retriever for a given question. The retrieved propositions are then used by QA model to extract the final response

Approach :

A Propositionizer model is used to decompose the passages into propositions. The model is trained using distillation process.
Teacher model is prompted to decompose passages into propositions with 1-shot demonstration.

Prompt to decompose a paragraph into propositions

Prompt should include instructions to split big passages into simple, small sentences that contain meaningful facts and resolve co-reference of entities in the individual propositions.
Teacher model is used to generate a seed set training data of 42k passages and their corresponding propositions.
The generated seed set data is used to train the student Propositionizer model.
Trained Propositionizer model is used to decompose the English Wikipedia dataset (6 million pages) into 41 million passages (100 words), 114 million sentences, 257 million propositions.
Approximately, on average each passage contains 6 propositions and each sentence contains 2 propositions.
Dense supervised and unsupervised retriever are used to validate the retrieval process using passages, sentences and propositions as retrieval unit for 5 QA datasets.
Retrieved passages/sentences/propositions are then fed along with the query to a reader model that extracts the answer to query.

Experimental Setup :

Teacher model - GPT-4
Propositionizer model (student model) - Flan-T5-Large (780M)
Reader model - T5-large size UnifiedQA-v2
Retrieval Corpus - English Wikipedia dataset
Open Domain QA Dataset — Natural Questions (NQ),TriviaQA (TQ), Web Questions (WebQ), SQuAD, Entity Questions (EQ)
Retrievers :
Unsupervised retriever - SimCSE,
Supervised retriever - DPR, ANCE (fine-tuned on NQ,TQ,WebQ,SQuAD), TAS-B, GTR
The approach was evaluated on two tasks — passage (information) retrieval and downstream open domain QA
The passage retrieval task was evaluated considering different retrieval granularities — passage, sentence and proposition level. Metrics used to measure the performance was Recall@5 and Recall@20
The downstream open domain QA task was evaluated using the different retrieval units (passage/sentence/proposition) as context along with the query from open domain QA datasets. Metrics used to measure the performance was Exact Match (EM) at l=100 and 500
‘l’ here denotes the first 100 or 500 words of the retrieved passage, sentence or proposition that are fed to the reader model to extract answer to a question.

Observations :

Passage Retrieval Performance

On an average, proposition level retrieval consistently outperformed passage or sentence level retrieval across all five QA datasets when using both unsupervised and supervised retrievers.
Unsupervised retrievers SimCSE and Contriever exhibit an average improvement in Recall@5 score of 35% and 22.5% respectively across five QA datasets.
Supervised retrievers DPR, ANCE, TAS-B, GTR exhibit an average improvement in Recall@5 score of 4.5%, 3.2%, 2.4%, 4.2% respectively across five QA datasets.
% of improvement when using supervised retrievers with proposition as retrieval unit was less compared to unsupervised retrievers as in case of supervised retrievers they have already been fine-tuned on the 4 out of 5 QA datasets (NQ,TriviaQ,WebQ,SQuAD) as part of their training process.

Retrieval Task performance using unsupervised and supervised retrievers

Passage Retrieval - Cross Task Generalization

Evaluation was done to understand the fitment of proposition level granularities for cross task generalisation.For each question in Entity Question dataset, the target entity was identified and was used to get the top-1000 passages using BM25 search. The number of occurrences of that entity in the top-1000 passages was used to determine the popularity of the entity.
Results prove that using proposition level retrieval granularity helped improve the retrieval scores for even less common entities (long tailed entities).

Downstream QA Performance

On an average, using proposition level granularity as context with l = 100 or 500 (number of words) to the reader model outperformed passage and sentence level granularity with same ‘l’ setting. Exact match (EM@100 and EM@500) metric was used to measure the performance.
Unsupervised retrievers SimCSE and Contriever when using proposition level granularity exhibit an average improvement of 50% and 55% in EM@100 score compared to passage level granularity .
Supervised retrievers DPR, ANCE, TAS-B, GTR when using proposition level granularity show an average improvement of 26%, 19%, 22% and 26% in EM@100 score compared to passage level granularity.

Downstream QA - Proposition retain rich information

Results prove that when using proposition level granularity rather than passages and sentences and setting number of words to 100–200 as input to reader model show larger improved performances in downstream QA tasks compared to setting number of words to more than 400.
This is hypothesised due to the reason that propositions are minimal, self contained and contextualised and contain concise fact.
The performance gains of all the retrieval granularities converge as the number of words start to increase because all the relevant information will be present with more words included.
A 100–200 words corresponds to 10 propositions containing more rich information compared to same range of words contributing to 5 sentences or 2 passages thus lacking rich information.
Hence proposition level granularity helps in downstream QA applications where there are compute and budget constraints to handle long passages as contexts for reader models.

Comparison of recall score for different ranges of words for downstream QA tasks

Limitations :

In-spite of propositions containing precise contextual information, they lack ability to handle questions that require multi-hop reasoning.

Conclusion :

In a period, where RAG has become a go to solution for most LLM applications this approach signifies the need to look into the granularity of the retrieval units which in turn impacts the quality of results retrieved.
While most of the times the retrieval unit remains passages, using proposition as a retrieval unit can be a good starting point given the fact they can help mitigate the input token limit and context stuffing of LLMs in an RAG system.
Given their concise, self contained and contextualised nature propositions can help retrieve more precise and relevant answers to even long tail entity questions that helps generalisation to wide variety of tasks.

E14 : Dense X Retrieval

Written by Praveen Thenraj