Promptriever: first Zero-Shot Promptable Instruction-Trained Retriever Model

6 min readSep 19, 2024

In this paper[1], authors present Promptriever, the first retrieval model able to be prompted with natural language prompts like an language model (LM).

Key contributions:

State-of-the-art bi-encoder performance on instruction-following retrieval tasks (+14.3 p-MRR, +3.1 nDCG/MAP on FollowIR with comparable performance to SoTA cross-encoders.
Improved robustness to query length/phrasing compared to RepLLaMA with a 44% decrease in variance on BEIR across instructions and a +12.9 improvement on InstructIR’s Robustness@10 metric.
Reliable improvements in retrieval performance zero-shot solely by prompting (such as adding “Think carefully when assigning relevance and I will give you a tip”), enabling prompt engineering efforts and auto-prompting methods

Figure below shows illustration of the capabilities of retrieval models.

Standard retrieval models find semantic similarity to the input query, typically matching using query keywords and phrases.
Current instructable retrievers prepend a dataset prefix that generically describes the task and is also used in training.
newly propose promptable retrievers which can handle complex instructions including detailed relevance definitions and zero-shot prompting techniques that act as a form of zero-shot hyperparameter optimization, similar to prompting LMs

Promptriever Data Generation

started with MS MARCO dataset
used the tevatron-msmarco-aug version which includes hard-negatives and was used to train RepLLaMA
This training set includes roughly 491k queries and provides a positive passage and 30 hard negatives for each.
Further augmented this set with instructions in a two part process: (1) instruction generation from the initial queries and (2) instruction-negative mining. Process followed is outlined in figure below

Full dataset can be found on Huggingface at https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions

i) Instruction Generation

Llama 3 used to generate instructions of:

(1) varying length formats (from short one-sentence instructions to two paragraph-sized instructions)

(2) differing “styles”, which can either be a persona of the person giving the query, negation of some aspect, or generic background information

a) Maintaining original MS MARCO positive relevance

provide both the query and positive passage to the LM when generating instructions, and request that the more specific instruction keep the passage relevant
used FollowIR-7B, a cross-encoder capableof making nuanced relevance judgments regarding
(query, instruction, passage) instances. FollowIR-7B marked roughly 15% of the generated instructions as making the original positive passage no longer relevant. Followed by that, substituted the original positive document with one generated from the next stage

ii) Instruction Negative Mining

develop a complementary data augmentation that encourages models to pay attention to the instruction, termed as instruction negatives, where the passage is query-positive but instruction-negative, i.e., when the instruction is added it decreases the passage’s relevance
To achieve low training loss, the model must learn to condition on both query and instruction
used gpt-4o-2024–05–13 to generate the instruction negative passages, generating one query-positive/instruction-positive passage and three query-positive/instruction-negative passages per (query, instruction) pair
over-generate candidates and then filter them post-hoc because initial testing revealed that (on average) only two out of three generated passages were correctly query-positive/instruction-negative
again use FollowIR-7B for the filter by: 1) checking that the generated instruction negatives are actually instruction-negative (and discarding it if not) and 2) checking that the generated instruction-positive was actually relevant (and discarding if not).

a) Filtering validation

tasked 4 human annotators with the filtration task: for a given (query, instruction, generated passage) triplet, is the passage relevant or not to the query+instruction
For this task, average human-human agreement was 75% (N=32), whereas, the average human-model agreement was 84% (N=64), confirming FollowIR-7B acts as a sufficiently high-quality filter

Promptriever Training

Promptriever was trained on the RepLLaMA model, with MS MARCO data as well as the new instruction data generated by Llama 3 and GPT-4o
used the same learning rate and other hyperparameter details as the original RepLLaMA for a fair comparison
used all valid instruction-negatives in training and sample the remainder of the hard-negatives from the dataset used to train RepLLaMA

Evaluation Results

i) Instruction Following

Table below presents the results for the FollowIR and InstructIR datasets

As per results, Promptriever is the highest performing dense retriever, improving over RepLLaMA by +14.3 p-MRR (-3.1 → +11.2) and +3.1 in nDCG/MAP
While cross-encoders (as expected) perform best due to their significant compute advantage, Promptriever achieves comparable scores as a much more efficient bi-encoder model
model’s strong performance versus the RepLLaMA baseline illustrates that our instruction data is highly effective for dense retrievers

ii) Standard Retrieval

benchmarked Promptriever on two standard retrieval tasks without instructions: both in-domain (MS MARCO) and out-of-domain (BEIR).
Table below shows MS MARCO (in-domain) performance

Promptriever performs comparably to RepLLaMA on in-domain tasks despite additionally having stronger instruction following performance

iii) Retrieval with Prompts

common approach to improving LMs on out-of-domain data is to include a textual prompt at testtime, even if the prompt is somewhat generic, e.g.,“think step by step” or “I’ll give you a tip.”
appled this approach to IR by exploring whether particular prompts reliably induce improved retrieval performance in Promptriever
Table below shows Out of domain performance on BEIR (nDCG@10).

Promptriever, using the best prompt brings significant gains to BEIR average performance (+1.4 nDCG@10; gains versus no prompt for 12/13 datasets and tied on the last
prompts fail to bring any gains to the RepLLaMA or BM25 models with -0.1 and -5.0 nDCG deltas respectively
Thus, it can be concluded that prompting is effective for Promptriever but not for retrieval models using standard training
Table below shows examination of sensitivity of all models to the prompts

Promptriever’s variance to prompts is significantly less than that of RepLLaMA (by 44%) and BM25 (by 77%) which has wide swings due to the effect of the keyword matching

Ablation Study of Model

Table below shows Ablations for instruction following on the FollowIR and InstructIR datasets

i) Factors helping model performance

a) training with instructions

results of training with just the instructions and no instruction-negatives (w/Instructions) shows a strong gain in p-MRR (+6.6) and a further gain in standard retrieval (+1) over Swap

b) training with instruction-negatives

Adding the instruction negatives on top of w/Instructions gives another large gain in p-MRR (+3.1 over w/Instructions) and a small boost in standard retrieval scores (+0.6 nDCG/MAP)

c) training additionally on MS MARCO beyond the Promptriever training set

Promptriever (Joint) our final model, combines all the MS MARCO and Instruction data which leads to another large jump in p-MRR (+2.4) as it is able to see more data (and instructions) in training, i.e. 2x as much

ii) Generality of the method when used with different LM backbones

Table below shows Comparison of different backbone models on the same Promptriever recipe across MS MARCO datasets (DL19, DL20, and Dev), BEIR, InstructIR, and FollowIR

original RepLLaMA used Llama 2 as a backbone, and, to this point in our paper, Promptriever has also used Llama 2 as a backbone for fair comparison
Results shows that other backbones provide comparable performance, indicating the generality of our method

Conclusion

presented the first zero-shot promptable retriever, Promptriever, trained from a new instruction-based retrieval dataset based on MS MARCO
Experiments show that Promptriever not only performs well on the standard retrieval task, but also follows instructions more effectively than prior work, adapting its notion of relevance per query.

Paper: https://arxiv.org/abs/2409.11136

Code: https://github.com/orionw/promptriever

Augmented Dataset used: https://huggingface.co/datasets/samaya-ai/msmarco-w-instructions

References: