Paper Review: Are LLMs Good at Search?

Published in

Thomson Reuters Labs

12 min readJul 11, 2024

The sliding window approach in this paper reminded me of sorting algorithms in Computer Science (image of insertion sort algorithm)

This is the first article in a series of paper reviews stemming from our internal Labs Reading Groups. In this issue, we will discuss the paper: “Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents” by authors Weiwei Sun, et al., recipient of EMNLP 2023 Outstanding Paper Award.[1][Link to Arxiv][Link to Code]

tldr: Using a method called permutation generation, the authors show that LLMs excel at re-ranking search results, beating other fine-tuned models even in zero-shot settings. To overcome the inference and token cost of LLMs, they propose a knowledge distillation method called permutation distillation to train smaller, more efficient models, which exhibit performance on par with existing fine-tuned models after training on a subset of the training data.

Background and Central Questions

Most successful methods in information retrieval (IR) rely on a two-stage process: 1. Retrieval: A search engine retrieves a set of candidate passages based on a query. This can be dense (embedding-based) or sparse (keyword-based) — or some combination of each. 2. Re-ranking: A re-ranking model scores and re-orders the candidate passages to present the most relevant ones to the user.

This paper focuses on re-ranking, which, traditionally, requires fine-tuning models on large human-annotated datasets. A downside of this is the significant human effort needed to label data, and it can result in weak generalizability in domains/areas with limited labeled data. Because of this, there is a growing interest in leveraging the zero-shot reasoning capabilities of large language models (LLMs) for re-ranking tasks.

LLMs have been extensively used in the Retrieval Augmented Generation (RAG) framework, where they are used in the content generation step to make sense of the retrieved passages and generate coherent responses/summaries. However, their effectiveness in the “retrieval” portion, i.e. as re-ranking agents, has not been thoroughly investigated because it is challenging to apply LLMs in this context due to major differences between the ranking objective and LLM pre-training objectives. Additionally, the inference and token cost of LLMs can be prohibitive in real-time search scenarios.

The authors explore these challenges with the following questions: 1. How effective are LLMs at re-ranking search results compared to fine-tuned models? 2. Can we distill the knowledge of LLMs into smaller models for efficient inference?

Results at a glance

Figure 1: Benchmark results — average nDCG @ 10

Figure 1 shows a comparison between BM25, monoT5–3B, ChatGPT, and GPT-4 on four benchmark datasets: — TREC (DL19 and DL20)[2] — BEIR (8 tasks: Covid, NFCorpus, Touche, DBPedia, SciFact, Signal, News, and Robust04)[3] — Mr.TyDi (a multilingual translation of MSMARCO, first 100 samples of the test set for each language)[4] — NovelEval (a new dataset introduced in this paper, 21 questions, hand-annotated)[1]

These results were obtained using the permutation generation method (discussed below) and show that GPT-4 outperforms monoT5–3B across all datasets. ChatGPT, a smaller model than GPT-4, also performs well in a zero-shot setting, indicating that LLMs can be effective re-ranking agents.

Figure 2: Student model nDCG@10 vs # of parameters and # of training samples

Figure 2 shows the performance of variations of the distilled student model compared to fine-tuned models. The student models (green line) exhibit performance on par with or even better than fine-tuned models after training on a subset of the training data, demonstrating the effectiveness of the proposed knowledge distillation method (discussed below). The same model (DeBERTa-Large), only fine-tuned on MS MARCO dataset directly (grey line), has significantly worse performance than the student, permutation-distilled, model. Interestingly, increasing the number of parameters helps more than increasing the number of training samples, and performance saturates after just a few thousand queries (although this is still a lot of pairwise examples, as discussed in the training objectives section).

Prompting an LLM for re-ranking

Figure 3 shows the three methods for prompting an LLM to perform re-ranking. The first two, query generation, a), and relevance generation, b), work by asking the LLM to — a) write a relevant query given a passage[5] — b) assess whether a given query passage pair is relevant[6]

The drawback to these approaches is that they’re not directly applicable to ranking; whether a given query-passage pair is relevant does not help with ranking the passage against other passages. To obtain rankings, these approaches would rely on log-probability scores from the model in order to assess the relevance of each passage to the query by using the model confidence scores as proxy ranking scores. Unfortunately, since many LLMs are hidden behind APIs, it is not possible to access these scores directly.

To solve these issues, the paper introduces a novel prompting method called permutation generation, which has the following steps: — Retrieve top-k candidate passages using BM25 — Using the prompt in part c) of Figure 3, ask the LLM to rank the candidate passages in groups of w, where w is the window size, starting from the back of the list — Slide the window from back-to-front using a step size of s, where s is equal to w/2, so that the top half of passages ranked by the LLM in the previous window are included in the next window — Repeat this process until the window reaches the front of the list

A key benefit of this method is that it allows the LLM to generate rankings directly, without relying on log-probability scores, and the windowing strategy helps to overcome the token limit of LLMs by breaking the ranking task into smaller subtasks.

However, one drawback of this approach is that there is a chance for the model to produce inconsistent results from window to window. Figure 9 shows how often this occurred in Davinci-003, GPT-3.5-turbo and GPT-4. Another drawback is that this approach increases the number of API calls, the token cost, and the inference time.

Tuning — Sliding window hyperparameters

Figure 4: Sliding window hyperparameter tuning on TREC-DL19 using GPT-3.5-turbo-16k

The authors experimented with different window sizes and step sizes to find the optimal configuration. They tested these hyperparameters using GPT-3.5-turbo-16k on the TREC-DL19 dataset and selected a window size of 20 and step size of 10 for its highest nDCG@10 score. Interestingly, the w=40 and s=20 configuration performed better on nDCG@5 and nDCG@1.

Knowledge distillation for efficient inference

Using LLMs to perform re-ranking is prohibitively expensive in terms of inference time and token cost, and is a bit like using a sledgehammer to crack a nut. To address this, the authors propose a knowledge distillation method called permutation distillation to train smaller, more efficient models that can perform re-ranking tasks with comparable performance to fine-tuned models. This method is effective even with black-box models like ChatGPT, which do not expose their internal log-probability scores.

Approach at a glance:

Sample N (10k) queries from the MS MARCO dataset
Retrieve M (20) candidate passages for each query using BM25
Use permutation generation to produce ranked list of passages using ChatGPT (teacher model)
Train the student model by reducing a pairwise loss function between the student and teacher outputs

Above is an overview of the training objective for the permutation distillation method. The student model is trained to minimize the RankNet loss,[7] a pairwise loss between its output and the teacher model’s output, which is the ranking of the candidate passages. The pairwise examples are created by taking all pairwise combinations of passages ranked by the teacher model and filtering those for pairs where the first passage is ranked higher than the second passage. This results in a total of M(M-1)/2 training examples per query, where M is the number of candidate passages. Based on this, even though the authors sample only 10k queries from the MS MARCO dataset, they end up with 1.9 million pairwise training examples for the student model (for M = 20).

Additional Experiments, Ablations, and Details

This paper includes a lot of additional experiments and ablations to validate the proposed methods and to understand the underlying mechanisms better, and we’ll touch on a few of the interesting ones here.

Experiment — Fine-tuned vs. knowledge-distilled models: student surpasses teacher

Figure 5: Fine-tuned models compared against permutation-distilled student models from ChatGPT

In Figure 5, the authors show the performance of BM25, ChatGPT, fine-tuned monoT5–3B, fine-tuned DeBERTa-Large, and fine-tuned LLaMA-7B models compared against the permutation distilled student models trained using ChatGPT ranks. The results show that the student models surpass the teacher model (ChatGPT) on DL19, DL20, and average BEIR benchmarks, and perform on-par with the fine-tuned monoT5–3B model while beating both fine-tuned DeBERTa-Large and LLaMA-7B models. This demonstrates the effectiveness of the permutation distillation method on training smaller, more robust and efficient models for re-ranking tasks. Interestingly, the 7B-parameter LLaMA model does not significantly outperform the 435M-parameter DeBERTa-Large model. The authors make their distilled student model artifacts and training code available for download in their codebase: [https://github.com/sunnweiwei/RankGPT]

Experiment — LLM comparison

Figure 6 shows a comparison among different available LLM services, including Google Bard, GPT-4, Anthropic Claude, Cohere Rerank, and more. The authors apply their permutation generation method to these models against the TREC-DL19 benchmark dataset. The results show that GPT-4 slightly outperforms the other models on nDCG@5 and nDCG@10, but not on nDCG@1, where Google Bard and Anthropic Claude-instant-1 perform better and Cohere rerank performs similarly well.

Experiment — LLM prompting comparison (OpenAI models)

Figure 7: Instruction and LLM comparison

In Figure 7, the authors compare various OpenAI LLMs endpoints (GPT-3.5, GPT-4, Davinci-003, and Curie-001) using different instruction strategies introduced previously, (i.e. Query Generation, Relevance Generation, and Permutation Generation). The results show that the permutation generation (PG) method consistently outperforms the other two methods across all LLMs, and that it has an outsized boost on nDCG@1. The authors conjecture that “LLMs gain a more comprehensive understanding of the query and passages by reading multiple passages with potentially complementary information, thus improving the model’s ranking ability” as a rationale for explaining the performance of PG.

One other interesting observation from these results is that GPT-3.5 and GPT-4 are comparable in performance on nDCG@1, even though GPT-4 is far better at nDCG@5 and nDCG@10 (boxed in Figure 5). Also, Davinci-003 is much worse than GPT-3.5 even though they have similar sizes, which the authors attribute to the observation that Davinci-003 is more prone to producing inconsistent results, i.e. missing passages, during re-ranking.

Ablation — Sensitivity to initial passage order

In Figure 8, the authors investigate the sensitivity of the permutation generation method to the initial passage order. They find that the performance of the model is very sensitive to the initial passage order generated by BM25. The performance of this method drops significantly when the initial passage order is randomized, indicating that the quality of the initial passage order is crucial for the success of the permutation generation method. Also, the random order (1) actually results in worse performance than reverse order (2), which is interesting and may warrant further study. In the reverse order (2) case, the model has to carry the most-relevant passages from the back of the order to the front without dropping them between sliding windows, which conceptually makes the task much more difficult. However, it is not clear why the random order (1) performs worse than the reverse order (2).

LLM Error Analysis

In Figure 9, the authors quantify the frequency of inconsistent or erroneous outputs from the LLM in the following categories: — Repetition: The same passage identifier is duplicated in the ranking — Missing: One or more passage is missing from the ranking — Rejection: The model rejects the prompt and refuses to rank the passages — RBO: Rank-based overlap — a method to measure the similarity between two rankings. How consistent is the model’s ranking from window to window?

The results show that missing passage(s) are the most common error for Davinci-003 and GPT-3.5-turbo, while rejection is the most common error for GPT-4. Overall, GPT-4 had slightly higher RBO than GPT-3.5-turbo, making it the most consistent and least error-prone model among the three. However, due to cost considerations, the authors generated their training data using relevance judgements from ChatGPT.

Thoughts and Takeaways

My colleagues and other Applied Scientists at here at TR Labs enjoyed reading this paper and had a lively discussion about its merits and how we could potentially apply some of the methods to our own work. Here are some of the key takeaways and thoughts from our discussion:

In a scientific environment where we’re faced with the challenge of working with black-box LLMs, this paper provides some compelling evidence (and effective strategies) for distilling such black-box LLMs into smaller, more practical models that can be used in real-time search scenarios without the need for model scores.

At Labs, we are exploring permutation distillation in our effort to improve passage re-ranking, and we are adding permutation generation as a new method to our toolkit for using LLMs to generate training data for re-ranking models.

In addition to the challenge of working without model scores, another issue with conducting research on LLMs is the lack of transparency in their training data. The authors introduced a new benchmark dataset, NovelEval, in part because of concerns that there could be potential data leakage in the LLM training data, which is important to keep in mind when evaluating outputs from LLMs.

At Labs, we are constantly designing and evaluating new grading tasks to estimate the quality of our retrieval systems, and we are leaning on our subject-matter experts to provide high-quality annotations for these tasks.

While the paper shows impressive results and is fairly detailed, there appears to be room for further optimization of the student model. For example, one could train the student model on LLM-ranks over additional datasets beyond the sampled queries from MS MARCO, or explore data augmentation approaches to improve the robustness of the student model.

At Labs, we are experimenting with other, more domain-specific, queries, and exploring additional hyperparameters, such as thresholding the training examples by rank.

Finally, and most importantly, while the authors choose to train the student model on a well-known dataset (MS MARCO) for the basis of comparison, these results suggest that it might be effective to use LLMs as teacher models for any set of queries, opening up possibilities to use LLMs for boosting IR performance on specific use cases, non-English language, or even specialized domains.

At Labs, we are excited to incorporate these techniques into our own research and development efforts, and we are looking forward to seeing how they might improve our existing systems.

💬 What were your thoughts about this paper? Let us know in the comments!

[1] Sun, W., et al. (2022). Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. ArXiv, abs/2304.09542.

[2] Craswell, N., et al. (2020). Overview of the TREC 2020 Deep Learning Track. ArXiv, abs/2102.07662.

[3] Thakur, N., et al. (2021). BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models. NeurIPS 2021.

[4] Zhang, X., et al. (2021). Mr. TyDi: A Multilingual Benchmark for Dense Retrieval. MRL 2021.

[5] Sachan, D. S., et al. (2022). Improving passage retrieval with zero-shot question generation. EMNLP 2022.

[6] Liang, P., et al. (2022). Holistic evaluation of language models. ArXiv, abs/2211.09110.

[7] Burges, C. J. C., et al. (2005). Learning to Rank using Gradient Descent. ICML 2005.