Intro of Retrieval Augmented Generation (RAG) and application demos
RAG Background
Today I will provide an introduction to Retrieval Augmented Generation (RAG) and demonstrate some applications. You can access the materials for this talk on my GitHub repository at below.
In the repository, you will find a PDF with the slides for my presentation. Later, we will do some hands-on experiments to apply RAG. All the code and data is available in the GitHub repo for you to follow along.
Let’s start with an overview of RAG. RAG is a powerful technique for enhancing large language models. In my view, we should focus on how to best apply large language models, and RAG is one of the most effective approaches, especially for developers.
Large language models have some intrinsic limitations. They can provide misleading or hallucinated information since they lack external knowledge. They rely on potentially outdated information since their training data has a cutoff date. For example, GPT-3 was trained before 2021. They lack depth and specificity on niche topics outside their training data. Training and fine-tuning LLMs is computationally expensive and infeasible for many organizations. The models cannot show where their knowledge comes from or ensure privacy compliance when sensitive data is provided.
RAG can significantly improve the accuracy and relevance of generated content. It first retrieves relevant information from an external database or documents before generating text. [1]
Here’s an example: imagine a user asks “How do you evaluate the fact that OpenAI’s CEO Sam Altman went through a sudden dismissal by the board in just three days, and was rehired by the company, resembling a real life version of Game of Thrones in terms of power dynamics?”
ChatGPT would not be able to answer this properly since the event happened after 2021. With RAG, we would first retrieve relevant documents, extract key snippets like “Sam Altman returns to OpenAI as CEO, silicon valley drama resembles a comedy”, “the drama concludes? Sam Altman to return as CEO of OpenAI, board to undergo restructuring”, “the personnel turmoil at OpenAI comes to an end. who won and who lost?” These three snippets would be combined into the prompt to provide context for the question. The large language model can then generate a coherent answer based on the retrieved information.
RAG Timeline and Techs
Looking at the history, RAG originated in academia with three main approaches: pre-train, fine-tune, and inference retrieval. Recently, more practical techniques focus on inference-time retrieval. Moreover, as before 2022, there were few RAG techniques proposed. However, after 2023, we can see the booming of various RAG techniques.
RAG improves the precision and relevance of LLM outputs by first retrieving relevant information from an external knowledge source before generating a response. The classic basic RAG process, also known as Naive RAG, mainly includes three basic steps.
- Indexing: Documents are split into shorter texts (“chunks”) and indexed in a vector database using an encoder model.
- Retrieval: Relevant chunks are found based on similarity between the question and chunks.
- Generation: The LLM generates an answer conditioned on the retrieved context.
The Advanced RAG paradigm involves additional processing in Pre-Retrieval and Post-Retrieval.
- Before retrieval, methods such as query rewriting, routing, and expansion can be used to align the semantic differences between questions and document chunks.
- After retrieval, rerank the retrieved document corpus can avoid the “Lost in the Middle” phenomenon, or the context can be filtered and compressed to shorten the window length.
The Modular RAG is also introduced. Structurally, it is more free and flexible, introducing more specific functional modules, such as query search engines and the fusion of multiple answers. Technologically, it integrates retrieval with fine-tuning, reinforcement learning, and other techniques. In terms of process, the RAG modules are designed and orchestrated, resulting in various RAG patterns.
To build a good RAG system, three critical questions need to be considered: What to retrieve? When to retrieve? How to use the retrieved content?
Augmentation Sources. Unstructured data such as text paragraphs, phrases, or individual words. Structured data can also be used, such as indexed documents, triple data, or subgraphs, or retrieving from content generated by LLMs themselves.
Augmentation Stages. performing during the pre-training, fine-tuning, and inference stages.
Augmentation process. The initial retrieval was a once process, but iterative retrieval, recursive retrieval, and adaptive retrieval methods, where LLMs decide the timing of retrieval on their own, gradually emerged in the development of RAG.
Below figure shows more detailed information about RAG triage, including augmentation stage (pre-training, fine-tuning, inference), augmentation source (unstructured data, structured data, LLM generated content), augmentation process (once retrieval, iterative retrieval, adaptive retrieval, recursive retrieval).
Below figure shows RAG relevant terminology and their reference papers.
RAG features
To better know the RAG is through comparison. RAG is like giving the model a textbook for customized information retrieval, which is very suitable for specific queries. Let me give an analog to explain. RAG gives models an external knowledge source, like giving a student a textbook for an open-book exam. Nevertheless, fine-tuning is like a student gradually acquiring knowledge suited for specific tasks, who is internalizing knowledge over time, better suited for mimicking specific structures, styles, or formats.
Based on the need for external knowledge and model customization, RAG and fine-tuning each have appropriate applications. Using them together can achieve the best performance. RAG requires little model adaptation but extensive external knowledge, while fine-tuning adapts models significantly but requires less external data. For most cases, RAG, Fine-tuning, Prompt Engineering combine together can yield the best results.
RAG evaluation
After implementing RAG, thorough evaluation is critical, using three quality scores, like context relevance, answer fidelity and answer relevance. The evaluation involves four key capabilities, robustness to noise, refusal ablity, integration of information, and counterfactual analysis. Standardized benchmarks like RGB and RECALL, as well as automated evaluation tools like RAGAS, ARES, and TruLens, are available to evaluate RAG systems.
RAG future
While powerful, RAG faces some challenges. With large context windows, performance may not improve. Making retrieval robust and filtering low-quality content is difficult. Incorrect content retrieved may poison the final answer. Balancing RAG and fine-tuning can be tricky. It’s unclear if larger models always improve RAG. The role of LLM need to further explored. Productionizing RAG at scale and protecting sensitive data are other concerns. Expanding RAG to handle images, audio and video remains an open problem.
But RAG shows promise for question answering, recommendation systems, information extraction and report generation. Mature RAG technology stacks are booming, such as Langchain and LlamaIndex, the market is also seeing an emergence of more targeted RAG tools, such as customized tools and simplified tools. Hence, the ecosystem will continue expanding with new tools tailored for RAG.
RAG practice
So far I’ve provided a high-level overview of RAG. Next, I will demonstrate some hands-on RAG experiments so you can apply these techniques in your own projects. I have three Python scripts to showcase different RAG pipelines by utilising LlamaIndex.
1. Basic RAG pipeline
2. Sentence-window RAG pipeline
3. Auto-emerging RAG pipeline
The basic RAG pipeline augments a large language model with an existing database. Queries first retrieve relevant context from the database before generating an answer.
We can chunk documents into smaller passages of 64 tokens with 2 token overlap. The passages are encoded into vectors and indexed in a vector database. Given a query, we search for the most similar passages, wrap them into a prompt with the question, and send this to the language model to generate an answer.
The Sentence-window retrieval pipeline is useful when we need more context. Instead of token chunks, we segment documents into sentences. We retrieve the most similar sentence as well as the previous and next sentences to form a context window. The context windows are re-ranked and provided to the language model.
The auto-emerging retrieval pipeline creates a hierarchy for retrieval. Small 16 token passages are linked to form 64 token passages, which are in turn linked to 256 token passages. If enough small passages link to a parent, they are merged into the parent chunk. The final chunks are reranked and retrieved. This allows dynamically sized context.
To use the code, first create a Python virtual environment from the yaml file provided in NLP.yml file under python_env folder. Add your OpenAI API key in openAI.env file under common folder. The sample data is in Henry.txt under data folder, but you can provide your own documents.
The basic pipeline chunks the document, indexes it in a vector database, takes a query, retrieves similar passages, wraps them into a prompt and sends it to the language model. We can see the original source of the retrieved passages.
The sentence-window pipeline does retrieval at the sentence level, expanding sentences with previous two sentences and next one sentence. Re-ranking is demonstrated to select the most relevant window.
The auto-emerging pipeline builds a hierarchy of passages from 16 to 256 tokens, merging passages into larger chunks as needed. It provides longer context while maintaining precision.
The code is designed to be plug-and-play. You can take it and apply RAG to your own documents and use cases after configuring the API key and virtual environment. Please try it out and let me know if you have any other questions!
References
[1] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J. and Wang, H., 2023. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997.
[2] https://github.com/HenryHengLUO/Retrieval-Augmented-Generation-Intro-Project
[3] https://www.llamaindex.ai/
[4] https://learn.deeplearning.ai/building-evaluating-advanced-rag
Appendix
This article was modified and rewritten according to Henry’s presentation at the GDG Hong Kong AI/ML Study Group on January 27, 2024. In conclusion, all attendees memorized a funny joke from the demo: "Henry is the most pretty boy in Hong Kong."
I appreciate Thomas and Kin for organizing the wonderful event.