Understanding Retrieval Pitfalls: Challenges Faced by Retrieval Augmented Generation (RAG) models

Improving the performance and application of Large Language Models

7 min readFeb 27, 2024

Image generated with Google’s Gemini, 24 February 2024.

Author

Amanda Kau (ORCID: 0009–0004–4949–9284)

Introduction

Large language models (LLMs) like GPT-4, the engine of products like ChatGPT, have taken centre stage in recent years due to their astonishing capabilities. Yet, they are far from perfect. Many of us have since learnt — perhaps when asking ChatGPT a question or employing it to write our reports — that LLMs can hallucinate. This happens when the LLM so eloquently expresses false knowledge that we might be fooled by it. This major flaw has spurred the popularity of Retrieval-Augmented Generation (RAG) techniques as a way to optimise an LLM’s responses.

To start off, this article will cover a brief overview of the key concept behind RAG. Subsequently, a review on several issues behind the retrieval step of RAG will be presented. In particular, this article will review ideas on when and what should be retrieved, the quantity of retrieved documents, effects of data quality, and RAG applied to different domains. Strategies proposed by the research community will also be briefly introduced for each challenge.

A Brief Introduction To RAG

Using RAG is very much like performing Google searches while writing a report. When the LLM attempts to generate a sentence without the requisite knowledge, it needs to reference an external source to gain that knowledge. Particularly prevalent in tasks like open domain question answering (ODQA), the LLM may be enquired about a specific topic which requires in-depth knowledge. Although LLMs are trained with extensive amounts of diverse data, it is unreasonable to expect the LLM to have internalised every potential answer and to deliver a response that an expert might provide.

This is where RAG comes in handy. RAG comprises two components: the retriever and the generator. The retriever gathers a set of supporting documents from a given source based on an input query and passes them on to the generator or LLM. This allows the LLM to avoid hallucinations, thereby crafting more well-informed and relevant responses.

Addressing Retrieval Problems and Strategies

1. When and What To Retrieve

Many retrieval-augmented language models utilise a single retrieval process, which is particularly restrictive if the model is tasked to generate a long passage of text. It is akin to writing a report whilst only being permitted to reference documents once at the start. Conversely, models that attempt to do multiple retrievals might do so at fixed intervals. This seems illogical when we draw parallels to our report writing scenario. It is like doing an online search every two sentences we write. When the LLM is routinely bombarded by additional information it does not require, the LLM might be disoriented, resulting in it returning incoherent or irrelevant responses.

Some attempts to address this problem have been made, such as the Forward-Looking Active REtrieval (FLARE) augmented generation method. The premise of FLARE is simple: retrieve supporting information only when the LLM signals low confidence indicating a lack of knowledge.

The subsequent challenge revolves around what information to retrieve for the upcoming sentence. Imagine if you were tasked to find evidence for a friend who’s formulating an argument, but you must provide the evidence without knowing what their next sentence will be. FLARE targets this by prompting the LLM to generate a temporary potential sentence and using this to retrieve documents — essentially, asking the friend what they might want to say next before doing your search.

2. Quantity of Retrieved Documents

The architectures of some RAG models confine them to retrieving a certain number of supporting documents. Picture writing a report whilst only having access to a fixed number of documents — your viewpoints and content would be largely based on that limited pool of knowledge. If that set of documents were replaced, your argument might undergo a drastic shift as your supporting information might change, resulting in a lack of consistency in responses. LLMs are vulnerable in a similar way and are sensitive to the quality of retrieved documents. That is, if retrieved documents prove irrelevant to the context, the LLM’s output suffers.

One attempt to mitigate this involves clustering the training data as in the case of MemGM, a memory-augmented generative model. This notion of memory-augmentation means the model internalises the characteristics of clusters of responses and uses them to aid response generation. To do this, the data, which consists of query and response example pairs, is grouped according to the similarity of queries. Subsequently, when MemGM retrieves information from this database, the cluster average representing the cluster’s characteristics is returned instead of a set of documents. This approach dilutes individual responses, hence decreasing the LLM’s sensitivity to individual documents. This also serves as a means to provide generalised support from a large number of documents to the LLM.

3. Quality of Retrieved Documents

Data Recency: The knowledge encapsulated within LLMs is frozen in time from when they were trained, preventing them from being able to provide up-to-the-minute information about the world. In fact, without taking corrective measures, LLMs become increasingly outdated and irrelevant over time. Nonetheless, they are still expected to interact with the ever changing world, so data recency is of critical importance.

One proposed solution is to harvest the power of the Internet. Not only does it contain the most current information, but years of refinement have enabled Internet search engines to properly rank results, navigate online safety and privacy concerns, and more. By allowing RAG models to access the Internet dynamically, LLM models would always have access to up-to-date information to craft accurate responses.

Data Relevance: Another factor influencing quality is the relevance of the retrieved results to the context of the conversation or question posed to the LLM. Ideally, irrelevant retrieved data should not harm the LLM’s performance, but this is not the case. Instead, irrelevant data tends to mislead the LLM to either provide inaccurate answers or lose sight of the initial context. This error compounds in multi-hop question answering scenarios where successive questions cause the LLM to be misled further with each question.

A simple natural language inference (NLI) model can be employed on top of the LLM to circumvent this issue. Simply put, the NLI model identifies the relevance of the retrieved information to the conversational context. If the retrieved documents are deemed as irrelevant, the prompt is given directly to the LLM without additional documents to avoid confusing it. Additionally, if training the LLM is an available option, studies have shown that even with a relatively small dataset of 1,000 examples, LLMs can be trained to disregard irrelevant documents.

4. Problems Generalising to Specific Domains

The final challenge discussed in this article occurs when we want the LLM to have expertise in specific domains, like healthcare or finance. It was briefly mentioned in the introduction to RAG that it contains a retriever component. The retriever in the original RAG model was trained on Wikipedia-based datasets. Therefore, RAG performs well when provided with domain-specific data that adheres to the Wikipedia article format, but fails with other data formats. A simple example is financial news, which is usually succinct in the form of news flashes or tweets, and does not provide additional context as it is assumed that the reader possesses the necessary background knowledge. This adds another disadvantage to the LLM as it will not have the necessary context to understand the content presented to it.

The RAG model can be further fine-tuned to be domain-specific to target this specific issue. In fact, a group of researchers proposed RAG-end2end which extends the original RAG by jointly training all of its components for a specific domain. However, this is very computationally expensive and unfeasible for someone who lacks access to multiple GPUs. Moreover, for broad topics like finance, it is key to incorporate macroeconomic information and other contextual information to give the LLM a well-rounded view of the situation, which remains an area of research.

Conclusion

In summary, this article has offered insight into some challenges that may arise when employing RAG and some solutions proposed by the research community. It highlights the importance of understanding data sources and retrieval steps performed in RAG such that the retrieval segment may benefit the LLM instead of harm its performance. In particular, one should pay attention to when and what documents are retrieved, as well as the quantity and quality of retrieved documents. Most importantly, retrieved information should be relevant to the context and desired domain for the generation of optimal responses.

References

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active Retrieval Augmented Generation (Version 2). arXiv. https://doi.org/10.48550/ARXIV.2305.06983
Tian, Z., Bi, W., Li, X., & Zhang, N. L. (2019). Learning to Abstract for Memory-augmented Conversational Response Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. https://doi.org/10.18653/v1/p19-1371
Komeili, M., Shuster, K., & Weston, J. (2021). Internet-Augmented Dialogue Generation (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2107.07566
Yoran, O., Wolfson, T., Ram, O., & Berant, J. (2023). Making Retrieval-Augmented Language Models Robust to Irrelevant Context (Version 1). arXiv. https://doi.org/10.48550/ARXIV.2310.01558
Siriwardhana, S., Weerasekera, R., Wen, E., Kaluarachchi, T., Rana, R., & Nanayakkara, S. (2023). Improving the Domain Adaptation of Retrieval Augmented Generation (RAG) Models for Open Domain Question Answering. In Transactions of the Association for Computational Linguistics (Vol. 11, pp. 1–17). MIT Press. https://doi.org/10.1162/tacl_a_00530
Zhang, B., Yang, H., Zhou, T., Ali Babar, M., & Liu, X.-Y. (2023). Enhancing Financial Sentiment Analysis via Retrieval Augmented Large Language Models. In 4th ACM International Conference on AI in Finance. ICAIF ’23: 4th ACM International Conference on AI in Finance. ACM. https://doi.org/10.1145/3604237.3626866