Stories by OpenRAG on Medium

Beyond Retrieval: Ushering in a New Era of Synergy Between RAG and Reasoning

OpenRAG — Fri, 02 May 2025 18:00:42 GMT

Breakthroughs in the reasoning abilities of large language models are redefining the boundaries of Retrieval-Augmented Generation (RAG) technology. How can we enable models not only to “retrieve knowledge” but also to “think like humans”? This question has sparked research into the deep integration of RAG and reasoning capabilities.

Timeline of studies on RAG-reasoning synergy

Recently, our team (led by Haofen Wang at Tongji University, Yun Xiong at Fudan University) released “Synergizing RAG and Reasoning: A Systematic Review”. It is the first work to systematically examine how to integrate RAG with reasoning, offering new insights into applying RAG to knowledge-intensive and complex tasks.

In the article, we focus on recent advances in deeply integrating retrieval and reasoning capabilities. Specifically, we aim to address the following key questions:

1. Why combine RAG with reasoning, and what new potential emerges from this synergy?

2. What are the collaborative patterns between RAG and reasoning? Are they pre-defined workflows or dynamic strategies, and how is the reasoning process initiated?

3. How can we realize a deep synergy between RAG and reasoning? What are the key techniques and optimization strategies involved?

4. What are the evaluation challenges and costs of such systems, and how can we balance performance with resource consumption?

5. In practical applications, how should “RAG+Reasoning” solutions be tailored to specific scenarios, and what are our recommendations and considerations?

Taxonomy synthesizing RAG and Reasoning

First, What is Reasoning?

Before diving into the synergy, it is important to clarify what we mean by “reasoning” in an AI system, especially as distinct from simple one-step “inference”. In the review, we define reasoning as “a structured, multi-step process that dynamically breaks down complex problems, generates intermediate hypotheses, and iteratively refines solutions using logic and evidence”. In other words, a reasoning-capable AI does not jump straight from a question to an answer. Instead, it breaks the problem into sub-problems, draws interim conclusions for each, and uses those to eventually reach a final answer — much like how a human would tackle a difficult question step by step. This is in contrast to a single-step inference where an AI simply produces an answer from the input in one go without that stepwise thought process.

To visualize this, imagine trying to solve a puzzle or a math problem. Rather than guessing the final answer outright, you would solve smaller parts of the puzzle and piece those solutions together. Similarly, a reasoning-enabled AI addresses parts of the question one by one and builds up the answer. This approach tends to be much more reliable for complex tasks than jumping directly to a conclusion.

What’s New in RAG+Reasoning?

Integrating reasoning into RAG yields several improvements over using retrieval alone. We identified five key upgrades:

Advantages of Combining RAG with Reasoning

1. Ambiguous Semantic Matching → Logic-Driven Targeted Retrieval

Traditional RAG uses superficial semantic matching and can misinterpret queries. With reasoning, the system understands the query’s intent more deeply and retrieves exactly the information it needs, rather than just related chunks. For example, if asked “How can we reduce post-operative infection risk in diabetic patients?”, a reasoning-enhanced system might deduce it should look for evidence on “blood glucose control thresholds” and “antibiotic guidelines”, rather than naively matching keywords like “diabetes post-op care”.

2. Simple Information Aggregation → Coherent Context Construction

Basic RAG might dump all retrieved chunks into the prompt, risking confusion or contradiction. A reasoning-enhanced system filters and connects evidence, checks for gaps or inconsistencies, and builds a coherent context for the model. This ensures the LLM works with information that’s logically consistent.

3. Simple and Single-Turn QA → Systemic Decision Support

Traditional RAG performs well in factual QA but struggles with multi-step problems or decisions. Adding reasoning allows the system to break down complex tasks, perform intermediate calculations, and handle multiple constraints systematically, rather than trying to answer in one go. For example, imagine an engineering query that asks for a construction plan meeting several safety and cost requirement. A reasoning-augmented RAG can iterate through these requirements, perhaps retrieve different pieces of domain knowledge for each, and then assemble a plan that accounts for all of them.

4. Indiscriminate Retrieval → Intelligent Resource Allocation

A vanilla RAG often retrieves a fixed number of documents regardless of query complexity. A reasoning-aware system can adjust: it might use minimal or no retrieval for a simple question, and do multiple rounds for a complex one. This adaptive strategy means less wasted time and computational cost on easy queries, and more focus on hard ones.

5. Passive knowledge Tool → Proactive Cognitive Assistant

Regular RAG is reactive, as it only responds, but never asks. With reasoning integrated, the system becomes proactive. It can ask the user clarifying questions when the query is ambiguous and anticipate follow-ups. For example, if you are researching a topic, it might suggest related information or warn of potential pitfalls (like contradictory evidence) before you even ask. In effect, it turns the system into a more interactive assistant that engages in dialogue to reach a goal.

These potential enhancements show how combining retrieval with reasoning makes the system more intelligent and user-centric, going beyond just fetching facts to actually understanding and addressing the user’s needs.

Why is RAG+Reasoning Needed?

Given the enhancements above, it’s clear why combining RAG with reasoning is a good idea: each component compensates for the other’s weaknesses. Let’s spell this out more explicitly.

The purpose of the synergy between RAG and reasoning

• Limits of RAG alone: Lacking deep understanding of complex intents, retrieval strategies are static and struggle to adapt to multi-step reasoning, making it difficult to efficiently balance accuracy and speed, and they are inadequate in integrating multimodal and dynamic knowledge.

• Challenges of reasoning alone: The reasoning space is vast, making it easy to fall into local optima; there is a lack of effective external knowledge verification mechanisms and intermediate state supervision, resulting in poor transparency and high computational resource consumption.

The idea of combining RAG with reasoning is to let each cover the other’s blind spots. We emphasize two synergy purposes:

• Reasoning-Augmented Retrieval (RAR): Use reasoning to make retrieval smarter and more iterative. The system can clarify the query and fetch information step by step. For instance, it might break a query into parts, search for each part separately, or reformulate the question if the first try wasn’t clear. Reasoning helps the system decide what to search for next. The result is that it finds the right facts more reliably and uses fewer unnecessary steps.

• Retrieval-Augmented Reasoning (ReAR): Use retrieval to make reasoning more grounded and accurate. Here, even when the model is reasoning through a problem, it continually pulls in real-world evidence to support each step. This keeps the model from hallucinating or missing knowledge. With the relevant facts at hand during its chain-of-thought, the AI’s reasoning stays on track and the final answers are much more trustworthy, which is crucial for tackling really hard problems where pure reasoning might falter.

What are the Patterns of Synergy?

If we agree that RAG and reasoning belong together, the next question is how to combine them. There are two common ways to organize the interplay between retrieval and reasoning: pre-defined workflow and dynamic workflow. Each approach has advantages in different situations.

Patterns of Synergy between RAG and Reasoning

Pre-defined workflow: This is a structured, deterministic integration. In a pre-defined workflow, the system follows a fixed sequence of steps, interleaving retrieval and reasoning in a predetermined order. You can think of it like a pipeline with a clear architecture: for example, “Step 1: break query into sub-queries → Step 2: retrieve documents for each sub-query → Step 3: reason over the collected info → Step 4: generate answer.” All the stages and their order are set ahead of time. This kind of workflow emphasizes controllability and transparency. This works well if you know the task structure in advance and can split the problem into sub-tasks.

Specifically, the pre- defined workflow pattern can be further divided into the following three types:

Pre-retrieval reasoning: Decompose complex problems before retrieval (e.g., business rule extraction in PlanRAG).
Post-retrieval reasoning: Perform logical verification and knowledge integration on retrieved results (e.g., conflict resolution mechanism in ActiveRAG).
Hybrid mode: Iterative “retrieve–reason–retrieve again” cycles (e.g., generation feedback optimization in ITER-RETGEN).

Dynamic workflow: This is a flexible, adaptive integration. In a dynamic workflow, the sequence of retrieval and reasoning steps is not fixed in advance; instead, the system decides on the fly what to do next based on the current context. Think of this like an agent that observes its own intermediate reasoning state and can say, “I need to fetch another piece of data now,” or “I should perform a verification at this point,” or even “I’ve done enough, time to finalize the answer.” There is a kind of real-time decision engine in the loop. This approach relies heavily on the LLM’s ability to self-monitor and control its actions, often implemented via special trigger tokens or internal policies. This is powerful for tackling very complex or unpredictable queries, but it is also more complex to design and can be harder to predict or debug.

Specifically, the dynamic workflow pattern can be further divided into the following three types:

Proactive-driven: The LLM actively triggers retrieval (e.g., real-time API calls in Agentic Reasoning).
Reflective-driven: The system adjusts strategies based on self-assessment of intermediate results (e.g., confidence threshold control in Self-RAG).
Feedback-driven: External reward models guide optimization (e.g., multi-dimensional process supervision in RAG-Gym).

In practice, both patterns have their place. Pre-defined workflows excel when the problem type is well understood and consistency is key. Dynamic workflows shine in open-ended scenarios where the system needs maximum flexibility. Often, a solution might combine elements of both. For instance, mostly following a dynamic approach but with a bit of structured guidance for critical steps, or vice versa. The right choice depends on the use case and the desired balance between reliability and adaptability.

How to Implement and Optimize “RAG+Reasoning”?

Building a well-oiled “RAG+Reasoning” system involves two parts: embedding reasoning capabilities into the pipeline and optimizing the combined process for performance.

Implementation and optimization of the synergy between RAG and Reasoning

For reasoning process, researchers have explored methods such as:

• Chain-of-Thought (CoT): This refers to prompting or training the LLM to generate explicit step-by-step reasoning before giving a final answer. By structuring the model’s internal thought process into a logical sequence, it enables multi-step problem solving in a transparent way.

• Special Token Prediction: Another clever technique is training the model to output special “signal” tokens that act as commands, telling the system to perform some action (like retrieval). This method allows fine-grained, interpretable control within a single unified model output. It’s like giving the model a remote control over the retrieval mechanism through its own text output. Many recent systems (like Self-RAG, SmartRAG, etc.) use this approach, effectively turning the LLM into an agent that writes its own toolkit instructions mid-generation.

Search-Driven Reasoning: This set of techniques explicitly marries search algorithms with the reasoning process. The idea is to use structured search strategies (like tree search or even Monte Carlo Tree Search) to explore different reasoning paths or hypotheses in a systematic way. This approach can be especially powerful for tasks like complex question answering or planning, where one might need to consider many possible routes.

Reasoning on Graph: In many situations, knowledge is better represented as a graph (nodes and relationships) than as unstructured text. Graph-based reasoning methods leverage this by explicitly constructing or utilizing knowledge graphs during the reasoning process. For example, if trying to answer a question that requires understanding relationships between several entities, the system can traverse a knowledge graph to find the links. This approach enables multi-hop reasoning with clear semantics, as the graph structure ensures that each hop is a well-defined relation (like “X is a subtype of Y” or “A causes B”).

External Solver: Sometimes the “reasoning” needed isn’t purely logical deduction but rather a precise computational problem, such as solving an equation, running an optimization, or executing code. In these cases, an effective strategy is to integrate specialized external solvers into the workflow. For example, the model might delegate a math calculation to a calculator, then use the result in its reasoning. This extends the system’s capabilities beyond what the LLM alone can do.

On the optimization side, key strategies include:

Prompt-Based Optimization: Prompt-based optimization boosts RAG and reasoning performance by using well-crafted language prompts to structure tasks, standardize results, and enable interactive adaptability — without changing model parameters. By decomposing reasoning into clear steps and leveraging role assignments, explicit output formats, and interactive tokens or instructions, prompts improve interpretability, reduce hallucinations, and allow dynamic adjustments during execution. This strategy enhances consistency, coherence, and external knowledge integration, outperforming traditional methods in complex scenarios. Its general, lightweight, and non-intrusive nature makes prompt-based optimization an effective foundation and mainstream approach for LLM reasoning, enabling robust task control and laying the groundwork for future hybrid methods.

Tuning-Based Optimization: This approach updates the model’s parameters through training (fine-tuning) so that it better handles the intertwined retrieval-reasoning task. Methods like CoRAG and DeepRAG fine-tune models for better retrieval paths and reasoning steps, while systems such as MCTS-KBQA and Self-RAG target structured outputs by training for precise control tokens and executable tool instructions. Other frameworks like O1-Embedder and Open-RAG use mixed fine-tuning and specialized modules for enhanced semantics and multi-hop capabilities. Collaborative strategies, typified by AdaptiveRAG and CR-Planner, use lightweight classifiers and critics to dynamically adjust retrieval strategies and assess reasoning quality. Overall, these approaches optimize models for efficient, accurate retrieval-reasoning, boosting performance across various complex tasks.

Reinforcement Learning (RL)-Based Optimization: RL has become crucial for enhancingRAG and reasoning tasks, providing flexible reward mechanisms that help language models balance knowledge retrieval and logical reasoning. RL optimization typically follows either outcome-based reward modeling (ORM), which focuses on final answer quality (e.g., R1-Searcher and KBQA-O1), or process-based reward modeling (PRM), which supervises intermediate reasoning steps (e.g., LeReT and ReARTeR). Recently, methods like GRPO — an efficient Proximal Policy Optimization (PPO) variant — have improved retrieval quality and enabled retry mechanisms, simulating persistent search behavior in LLMs.

Hybrid approaches combine ORM and PRM via composite rewards, as seen in SmartRAG and RAG-Gym, optimizing both overall output and stepwise reasoning, thereby reducing retrieval costs while maintaining performance. Academic research often explores such RL-based methods using small-scale LLMs like Qwen and Llama. Overall, RL frameworks in RAG promote both global strategy optimization and local robustness, paving the way for future advances in multi-agent, offline RL, and more granular reward structures in open-domain reasoning.

By mixing the right techniques with smart optimization, we can build systems that not only reason and retrieve, but do so efficiently and coherently. The next question is: how do we know it is working well? That’s where evaluation comes in.

Downstream Evaluation Tasks

So far, the progress in RAG+Reasoning is often measured on tasks like multi-hop question answering, where the AI must gather and connect information from multiple sources to answer a question. Systems that combine retrieval and reasoning have been pushing the state-of-the-art on benchmarks such as HotpotQA and 2WikiMultihopQA, which test exactly those multi-step reasoning capabilities.

However, standard benchmarks only scratch the surface of what we want these systems to do. They usually check if the final answer is correct, but they do not evaluate how the AI arrived there. Did it reason properly through each step? Did it use evidence correctly? There is a growing sense that we need new ways to evaluate reasoning chains and ensure an AI’s process is sound, not just the end result.

The current downstream tasks and datasets re-
lated to the combination of RAG and Reasoning

On the horizon are more ambitious tasks that really showcase why RAG+Reasoning is valuable. We categorize some of the representative evaluation scenarios as follows:

Deep Research Tasks: These are extremely complex, open-ended information seeking tasks. It is the kind you might associate with doing research rather than answering a single question. For example, the system might be asked to investigate a scientific hypothesis, analyze a historical event with multiple sources, or compile a report on a broad topic. RAG+Reasoning systems have shown significant improvements on these tasks, as they can manage the breadth and depth required better than either alone.

• PhD(Expert)-Level Complex Reasoning Tasks: These tasks are designed to mimic what a human expert (like a PhD-level person) might do, often within a specialized domain. They require advanced domain knowledge plus complex reasoning. For instance, a medical diagnosis task where the AI must reason through symptoms and tests, or a legal analysis task where it must interpret laws and precedents. These tasks demand not only retrieving the relevant domain knowledge (e.g. clinical guidelines, legal clauses) but also applying it with rigorous logical consistency and depth of reasoning.

To support the evaluation of the above, researchers have also developed benchmarks and datasets that capture these challenges. For example, WildSeek is a dataset mentioned in our review which is built from multi-domain “deep research” queries requiring multi-hop reasoning and tool use. Another work introduced a PhD-level exam dataset covering finance, medicine, and law, to test expert-level reasoning under time constraints . These new benchmarks go far beyond trivia QA; they are explicitly constructed to require the kind of retrieval+reasoning synergy we’ve been discussing.

Looking forward, several challenges remain. One is managing dynamic, real-time information, making sure the AI can reason with fresh data that changes frequently (like news or live sensor data). Another is incorporating deep domain expertise safely and effectively, so the AI can handle specialized fields like medicine or law with the necessary rigor. As our reasoning chains get longer and more complex, issues of traceability and debugging become critical; we need better ways to watch the AI’s thought process and intervene if it goes astray. Finally, as these systems become more autonomous, ensuring safety and alignment with human values will be an ongoing concern. Solving these will be key to the next generation of robust RAG+Reasoning systems.

Hidden Costs: The “Invisible Tax” of Reasoning

We find that integrating reasoning into RAG systems comes with an overlooked cost — an “invisible tax” on resources and speed. A basic LLM is quick and efficient but limited to its training data; adding RAG expands its knowledge but introduces extra processing, storage, and token overhead. Adding multi-step reasoning further increases latency, token use, and complexity, while introducing new security and reliability risks. In fact, each extra reasoning step can make computational load and token usage skyrocket , while repeated retrievals yield diminishing returns . Because these hidden costs are easy to underestimate, we advocate for fine-grained Cost Models to quantify trade-offs and guide design choices.

From LLM to RAG and then to RAG+Reasoning, performance improvement comes with additional cost.

Non-Linear Growth of Computational Resources. RAG+Reasoning splits retrieval and reasoning into stages, causing superlinear growth in computational demand. Each extra reasoning or retrieval step greatly increases complexity. Methods like MCTS or multi-step planning drive up GPU runtime and memory compared to linear approaches. Though they improve accuracy, resource needs escalate sharply with model size and task complexity, presenting scalability challenges.

Implicit Token Inflation. Multi-step reasoning inflates token usage through intermediate thoughts, documents, and feedback. Active learning and chain-based methods produce more tokens by combining various results and exploring paths. These demands grow quickly with complex tasks, deep or broad reasoning, and long outputs, raising both API costs and memory use.

Marginal Decline in Retrieval Efficiency. Dynamic retrieval aids precision but loses efficiency as tasks become complex. While adaptive retrieval helps on simple tasks, complex ones require repeated iterations, increasing overhead. Advanced retrievals improve quality but cost more, and optimizations can’t eliminate extra training and runtime overhead, challenging the balance between accuracy and efficiency.

Cost quadrant diagram of retrieval and reasoning requirements

Toward a Cost Model Framework. Fine-grained cost models are needed to balance accuracy and efficiency, since current single-task metrics miss the joint impact of computation, tokens, and retrieval. Without such models, true tradeoffs in reasoning frameworks go unmeasured, as accuracy gains may be offset by rapid cost and latency increases. Detailed cost modeling is key to realistically assessing reasoning-based approaches.

Practical Guide on RAG+Reasoning

Finally, to help practitioners and researchers apply the findings of our review, we compiled a Practical Guide with recommendations on tailoring “RAG + Reasoning” solutions to different scenarios.

Practical guide to synergizing RAG and Reasoning

Industry Scenario Considerations

We recommend first understanding the nature of the tasks in your domain along the three core stages of an information-seeking pipeline: Query, Retrieval, and Generation. Different industries place different demands on each stage.

Query Stage: How complex or implicit are user queries? In some domains like law or finance, questions can be layered and require a lot of interpretation. For example, in finance, a query might implicitly ask for reasoning about market conditions, so the system needs to parse that and possibly reformulate specific sub-queries. That is, preserving the original intent and nuances through the reasoning process (without misinterpreting) is essential for good results.

Retrieval Stage: Look at the nature of knowledge sources in your domain. Are they static or rapidly changing? Are they homogeneous (all similar type of documents) or heterogeneous (mix of text, databases, etc.)? Domains with frequently updating information (like news or real-time data) demand a RAG system that is adaptable and can incorporate updates seamlessly. This might involve special indexing strategies (e.g., cold-hot tiered indexing where frequently needed info is quickly accessible) and reasoning methods that can decide when to fetch new data.

Generation Stage: What does a “good” answer look like? In casual settings, it can be concise and just correct enough. In critical fields (medical, legal), it needs to be comprehensive, correct, and come with explanations or citations. Also consider latency: do users need answers instantly, or is a short wait acceptable for a better answer? Ensuring the output is trustworthy and clear is especially important in professional settings.

2. Design Principles and Pitfalls

Task-specific design guidelines should be established to define clear operational boundaries. For example:

In predictive tasks, decomposition into sub-tasks and constraint-based generation using structured knowledge (e.g., knowledge graphs) is recommended to avoid unverifiable reasoning jumps.

In dynamic environments, prioritize lightweight retrieval caching and prompt-based adaptation over prolonged fine-tuning to reduce system latency and maintain agility.

In high-risk domains, implement multi-layer verification pipelines and rule-based filters to safeguard decision quality, and strictly prohibit autonomous execution of critical decisions without human oversight.

3. Opportunity Points:

Our review also highlights some exciting opportunities for pushing “RAG+Reasoning” further. Here are two examples:

• Cold-hot Tiered Indexing and Dynamic Context Management: Developing smarter retrieval indexing that separates “hot” frequently-used information from “cold” long-tail info, and dynamically managing what context the model sees. This can speed up retrieval and help the reasoning module focus on relevant knowledge by prioritizing likely useful info while still allowing access to the long tail when needed.

• Fine-Grained Layering and Confidence Grading: Creating multi-layer models or workflows where easier subtasks are handled by a simpler (faster) layer and more complex reasoning kicks in only when needed. Also, having the system self-grade its confidence at each reasoning step. If confidence is low, perhaps branch out or retrieve more information. This could optimize the effort spent on each query and ensure that uncertain reasoning is caught and addressed.

Conclusion

The synergy of RAG and reasoning is ushering in a new generation of AI assistants that are both knowledgeable and thoughtful. Our team’s review provides a foundation and a roadmap for developing these next-gen systems, and we are excited about what’s on the horizon. The synergy between RAG and reasoning is advancing rapidly, showing huge potential in academia and industry alike, that is, from more insightful financial analytics and smarter healthcare support, to more reliable intelligent personal assistants.

Looking ahead, there are plenty of research avenues to explore. Promising directions include graph-based knowledge integration (to connect and reason over structured information), coordinated reasoning across modalities (so AIs can seamlessly combine text, images, and more in their thinking), and applying advanced methods like reinforcement learning to further refine how retrieval and reasoning work together. With careful research and engineering, we expect future systems will not only be more powerful in tackling complex tasks, but also more transparent and trustworthy in how they do it. In short, the future looks bright for “RAG+Reasoning”, and we are looking forward to seeing (and contributing to) the innovations to come.

Resource

Full Paper: Synergizing RAG and Reasoning: A Systematic Review (arxiv:2504.15909).

Figure : All of our Figures (Except for that Sankey charts) are created using Figma. Many thanks to @Yuxi BI for the excellent work.

You are welcome to use the figures from our paper as long as you provide appropriate attribution and citation.

Paper List: As before, for more related resources and a collection of relevant papers, please visit the OpenRAG platform: OpenRAG.
It is a paper knowledge base built on Notion. Our original intention in creating this was to help everyone quickly find and compare relevant papers of interest through multi-dimensional attributes.

Paper List in OpenRAG

Citation

If you found our work useful, a citation would be greatly appreciated. Thanks so much!

@article{gao2025synergizing,
  title={Synergizing RAG and Reasoning: A Systematic Review},
  author={Gao, Yunfan and Xiong, Yun and Zhong, Yijie and Bi, Yuxi and Xue, Ming and Wang, Haofen},
  journal={arXiv preprint arXiv:2504.15909},
  year={2025}
}

Contact Us

Yunfan Gao (First author & blog author) : gaoyunfan1602@gmail.com
Hofen Wang (PI of the lab, Tongji University): carter.whfcarter@gmail.com
Yun Xiong (PI of the lab, Fudan University): yunx@fudan.edu.cn

KaLM-Embedding: Reshaping Multilingual Text Embedding Models

OpenRAG — Mon, 20 Jan 2025 10:22:11 GMT

1. Introduction

In the era of rapid development of large language models (LLMs), retrieval-augmented generation (RAG) has become a key approach for enhancing model performance. However, with the widespread adoption of the RAG framework, text embedding models have increasingly become a bottleneck hindering further progress. Traditional embedding models often perform inadequately in multilingual and multi-domain tasks due to the poor quality of training data. To address this challenge, we introduce the KaLM-Embedding (Knowledge in large Language Models into Embedding) model, which outperforms other models of similar scale in multilingual capabilities, as demonstrated in the MTEB (Massive Text Embedding Benchmark) evaluation.

2. KaLM-Embedding: Innovative Training Methods for Superior Multilingual Models

(1) Data Collection: The Foundation of Model Success

During the development of the KaLM-Embedding model, we meticulously designed a data collection strategy to ensure the model excels in multilingual and multi-domain tasks.

Large-Scale Open Source Datasets: A Combination of Diversity and Quality

Pre-training Data: During the contrastive pre-training phase, large-scale weakly-supervised pairs data is introduced to transform the original language model into an embedding model, enabling it to acquire preliminary text embedding capabilities, which lays the foundation for subsequent fine-tuning. We utilized title-body pairs from various documents as well as symmetric translation sentence pairs, supplemented with a portion of large-scale supervised question-answer datasets to ensure the diversity and coverage of the data.
Fine-tuning Data: During the fine-tuning phase, we introduced over 70 high-quality datasets from different sources. These datasets are diverse and of high quality, providing ideal conditions for the model’s fine-tuning despite their smaller size. We also incorporated multiple classification and clustering datasets, treating each (sentence, category label) pair as a training instance. Additionally, we sampled hard negative examples from all classification datasets to mitigate the issue of insufficient label categories in some datasets. For each specific dataset, we conducted meticulous processing, such as filtering out overly short documents or excluding low-quality parts based on metadata.
Data Purity: To ensure data purity, we only used the training sets of all datasets, explicitly excluding any test sets. For datasets without separate training and test sets, we first filtered out test set samples included in MTEB and then processed the remaining data. This strategy ensures that all examples appearing in MTEB evaluations were not seen by the model during training.

Despite the fine-tuning data being primarily in Chinese and English, with only a small amount of multilingual data, the model’s performance in other languages remains satisfactory, indicating that the multilingual advantages of pre-trained LLMs can be successfully transferred to embedding models.

Persona-Based Synthetic Data: Enhancing Data Diversity and Domain Coverage

We generated 550,000 high-quality synthetic data entries using Qwen2–72B-Instruct, covering six task types and 40,000 unique instructions. To enhance data diversity, we introduced random persona from Persona Hub as system prompts, effectively increasing the domain diversity of the generated data. Since four retrieval tasks require instruction generation before data generation, we only introduced persona during the instruction generation phase to avoid persona conflicts between the two stages.

(2) Training Strategies: Key to Optimizing Model Performance

Ranking Consistency Filtering: Precise Sample Selection

In addition to using in-batch negative samples, we also retrieved hard negative samples from the dataset’s corpus. However, some queries may correspond to multiple correct documents or answers, or be too broad, leading to associations with multiple documents despite low relevance. These situations can introduce false negative samples, adversely affecting model optimization.

To address this issue, we adopted the ranking consistency filtering method (top-k filtering), ranking the similarity of queries with their original positive sample data across the entire document corpus and filtering out samples not ranked in the top k. This process is conducted simultaneously with hard negative sample mining to avoid redundant calculations.

Semi-Homogeneous Task Batching: Balancing Difficulty and Risk

Previous research adopted the homogeneous task batching method, where each batch contains samples from a single task to increase the difficulty of in-batch negative samples. However, this also introduces the risk of containing too many false negative samples.

We introduced the concept of semi-homogeneous task batching, first constructing a complete homogeneous task batch, then sampling, mixing, and randomly reassigning a specified proportion of samples back to the original batch to balance the difficulty of negative samples and the risk of false negatives. However, this method was not used in our latest model but provided a controllable means of analysis.

Matryoshka Representation Learning: Achieving Flexible Dimensional Embeddings

We employ Matryoshka Representation Learning (MRL) for training, setting different vector dimensions such as 896, 512, 256, 128, and 64 to enable flexible selection of encoding dimensions. In scenarios where richer semantic representation is required and high performance is pursued, larger-dimensional vectors can be chosen; whereas in cases where retrieval efficiency is prioritized or the text semantics are simpler, smaller-dimensional vectors are more practical.

Task Instructions: Enhancing Model Understanding and Generalization

Task instructions can significantly improve the performance of embedding models by reducing ambiguities between different tasks in the embedding space. During training, we prepend instruction prefixes to queries from open-source data and adopt a similar setup during testing. For synthetic data, we retained the originally generated instructions, covering various retrieval tasks. In practical applications, it is recommended to customize task instructions based on specific scenarios and requirements. Given that our model has been trained on a large number of synthetic instructions, it demonstrates strong capabilities in understanding and generalizing instructions.

3. Experimental Results: Robust Multilingual Performance

We selected MTEB (Massive Text Embedding Benchmark) as the primary evaluation and analysis dataset due to its coverage of diverse task types and a wide range of datasets. Although our main optimization targets are Chinese (zh) and English (en), we also evaluated the model on French (fr) and Polish (pl). Our KaLM-embedding-mini-instruct model demonstrated significantly superior overall performance across multiple languages compared to other models. However, its performance on Polish was relatively weaker, which may be attributed to the lower proportion of Polish data in the training set, particularly the language distribution bias in synthetic data. The specific results are as follows:

We conducted ablation experiments on training strategies and data selection. Due to the small weight assigned to low-dimensional Matryoshka embeddings during training, the impact of Matryoshka representation learning on the final results was minimal. The influence of task instructions was particularly significant, especially considering the mix of various types of training data we used. Data selection had a more pronounced improvement effect on English than on Chinese, which may be due to the fact that the English evaluation included more out-of-domain data, making the enhancement of generalization ability through data selection more noticeable in English. Additionally, in contrast to other models, our pretraining had a smaller impact on the final results, which may be attributed to the broader and cleaner nature of our fine-tuning data. The specific results are as follows:

4. Conclusion and Future Directions

The success of the KaLM-Embedding model demonstrates the crucial role of high-quality data in enhancing model performance. The research team has open-sourced the model for use by researchers and developers, aiming to promote the development of multilingual embedding technologies.

Future research directions:

Long-text Embedding Representation: Due to the complexity and diversity of long-text information, a single vector representation may lead to information loss and perform worse than sparse representation methods. Effective representation of long texts may require the use of multiple vectors or dynamic-dimensional vectors. As the context length in language models continues to increase, how to train an effective long-text embedding model remains a challenge to address.
Model Merging: Model merging, as an application of multi-task learning, faces conflicts and differences between tasks. In our experimental experience, the performance of merged models may significantly decline, and task-type differences are the primary reason for poor merging results. How to effectively integrate embeddings in a multi-task model is an area worth exploring.
Model Architecture Innovation: The impact of different base models and pooling methods on model performance requires further in-depth research. While different methods show consistency in performance, the practical differences are not significant. High-quality data and training strategies remain key to pushing the performance limits of models, and innovative model architecture design will be an important research direction in the future.
Adaptive Instructions: Current task instructions still require manual design and selection based on specific tasks. Whether it is possible to generate instructions adaptively and automatically select task-relevant instructions during inference to optimize performance is a direction worth further investigation. Additionally, in embedding models, it may be valuable to explore the use of continuous, non-discrete vectors as instructions for different tasks, which could further enhance task adaptability and performance.

Open Source Links:

OpenRAG Base: Your individual RAG Knowledge Base

OpenRAG — Wed, 03 Apr 2024 10:06:18 GMT

We’re officially launching the RAG Knowledge Base: OpenRAG Base 🎉✨

Check the Website !

What is ?

Open RAG Knowledge Base is a Notion repository of RAG knowledge that is built upon the collection, organization, and aggregation of publicly available resource. Maybe the most comprehensive source for RAG information at present🧐. Including:

Academic Paper
Cutting Edge Readings
Benchmark and Evaluation
RAG Scholar and Institution
Downstream tasks and Dataset
Toolkit
…. More content coming soon.(e.g. seminar, baseline, cookbook)

Although more focused on academic research, whether you are just getting started with RAG, are a RAG-related researcher, or are a practitioner, I believe you can benefit from it.

Based on this repository, you have a highly flexible and dynamically updated survey, and it can support highly customizable analysis and summarization . For example, if you want to see which papers have open-sourced code, which conferences these papers were published in, which papers used the same dataset, in order to quickly compare them.

💡 With OpenRAG Base, you will have a RAG knowledge base tailored exclusively for you.

We will continue to update this project

Why do this ?

The current conventional practice for organizing survey papers is to list relevant papers through Readme on Github (we used to do this before as well ).

However, we find this to be a relatively inefficient and non-intuitive method, including the following points:

The content is fixed, and users cannot quickly find the corresponding content according to their own needs.
The waterfall display makes the page too long and looks very redundant.
It is relatively difficult to update.
User comments and other interactive methods are more cumbersome.
Only data collection, lacking analysis and summary.
Each user sees the same content, unable to provide personalized services.

💡 We hope to have a more flexible and intuitive platform that helps users analyze and grasp technological developments, rather than simply stacking up materials.

So… ultimately…. we chose Notion.

How to Use ?

If you haven’t used Notion before, that’s okay, it’s very easy to operate (See Official Tutorial).

The entire knowledge base consists of multiple Notion Databases, with linked relation between the Databases

Click on a specific Database in the Database List on the homepage to view detailed content.

Below, we will introduce each Database individually:

💡 Note: When browsing online, you can only view static pages and cannot make modifications. You need to copy to your local in order to make changes.

1.Click “Duplicate” in the top right corner of the homepage (you need to be logged in to Notion).

2.You can duplicate the entire project or a single page (when duplicating a single page, data associated with other databases will not be displayed).

Copy the entire project into your Notion, where you can make any modifications to it. Copying a single page again will synchronize that page with ours.

Academic_Paper

This is the main page, which contains academic papers on RAG under the context of large models. We will use this Database as a case study to detail several common methods of using OpenRAG Base.

Overview

Each row represents a paper on RAG, and we have designed over ten properties to help with analysis and summarization, with the paper’s abbreviation serving as the primary key. (See the concept of Notion DatabaseProperty)

The main page of Academic Papers.

2. Common operations

The control area is located in the upper right corner, where you can filter, sort, and search fields.
Clicking on ‘···’ opens more controls, where you can select the displayed fields (properties) or group them. The “All” view will display all fields by default, which may contain a lot of redundancy. We will also provide targeted displays in other views, and you can create your own views as well.

The control area is located in the upper right corner and common operations include: querying, filtering, sorting, displaying fields, and grouping.

3. Pages

This is one of Notion’s unique features where each piece of data can be a Page. The paper’s abbreviation serves as the primary key, and hovering the mouse next to it will display “Open”. Clicking on it will expand the detailed page of the paper on the right. In the upper right corner, you can switch the sidebar to full screen mode for quick scrolling through data.

The detailed page includes all the attributes of that data.
It functions as a standalone page where you can record content, images etc.

Click on “Open” next to the paper abbreviation on the right to open the detailed page.

In addition to displaying the attributes, for the convenience of readers to quickly understand the paper’s content, we have three additional sections on the detailed page:

Abstract and author information from the original paper
Important figures and charts from the paper, typically process or model architecture diagrams
Quick reading (In Chinese)

The paper detail page presents a quick interpretation of the original paper.

The quick reading guide utilizes Papers.cool and Kimi Chat (an excellent LLMs tool for assisting in reading papers 💪).

The address for Papers.cool is right below the paper title, for example: https://papers.cool/arxiv/2402.07630

The usage of Papers.cool is also very simple, for example, you just need to enter the Arxiv id on the website：https://papers.cool/arxiv/.

You can further engage with Kimi Chat by accessing it from the bottom of the webpage.

On the Papers.cool page, click on [PDF] to open the original paper and click on [KIMI] to generate a quick reading interpretation.

If you want Kimi to generate an English interpretation, you can use Ctrl/command + F, then select settings, set the Kimi Language to English, don’t forget to save, and after refreshing, it will output in English.

Set the language of the quick reading guide to English,

4. Views

Below the title in the top left corner is the view area, displaying the current list of views.

You can save your custom settings as a view to make the table more targeted and easier to access later.

In addition to tables, views also support formats like boards, timelines, galleries, and more.

For example, if you want to specifically view tasks and datasets in the RAG paper, you can filter out these fields and set corresponding filters. These settings will be displayed below the view.

An example of a view for RAG Tasks and Datasets:

Another commonly used view is the Board view, often used in conjunction with grouping.

For example, if you want to see which papers are related to pre-training, fine-tuning, and inference, you can create a Board view and use “Aug_Stage” as the grouping criteria. You can also select the fields to display, apply filters, and set sorting preferences.

An example of a Board view grouped by the Augmentation Stage

5. Relation

“Relation” is another important concept in Notion. It is represented by an arrow in the properties and can be understood as an attribute that links two databases together. This attribute acts as a regular property in one database and as a primary key in another database.

See Detailed Information for Notion “Relation”.

Official Example for Notion “Relation”

In the OpenRAG Base, we have set up multiple Relation properties such as Scholar, Institution, Dataset, etc. This allows us to conduct more targeted analysis based on these properties. Each property corresponds to a database where it serves as the primary key for linking related information.

“Relation” examples in OpenRAG Base

In the example above, there are three Relations. By clicking on the table header and entering “Edit Property”, you can see the right sidebar where the “Preview” section displays which two databases are linked by this Relation.

In the specific example, the Relation is bidirectional, meaning the “Dataset property” appears in the Academic_paper database and links to the Dataset table as a primary key. Conversely, the primary key “Paper” in the Academic_paper database will automatically link to the Dataset table.

Now, let’s open the Dataset Database on the right to have a more visual understanding. You will see that a “Paper” Relation property automatically appears in the Dataset database.

💡 This means that when you add a dataset to a paper in the Academic_paper database, the “Paper” property in the Dataset table will automatically include that paper. This bidirectional relation ensures that the information stays connected and updated across databases.

In a bidirectional relation, the Dataset table will automatically link to the primary key (Paper) in the associated table (Academic_paper) and update accordingly. This ensures that the data remains synchronized between the two databases and reflects any changes made in either database.

Below we will briefly introduce other databases, which can be used in the same way as Academic_Paper.

Downstream Task and Dataset

These two databases summarize the downstream tasks and datasets used in the RAG domain. The datasets have been presented in the previous section, and the downstream tasks are summarized as follows:

Downstream Task in Gallery View

You can also refer to the summary of downstream tasks and datasets in our survey.

《Retrieval-Augmented Generation for Large Language Models: A Survey》

RAG Readings

In addition to academic papers, there are many other channels that can help us access cutting-edge information about RAG.

We have selected some important reading materials related to RAG and placed them in this database, including Medium articles, WeChat articles (In Chinese) , Zhihu articls (In Chinese), and official blogs of technical frameworks (such as Langchain, LLamaIndex, Neo4j), and YouTube videos.

Since these selections are subjective, they may be influenced by personal impressions. If you come across excellent articles, you can also inform me through comments.

RAG Readings from different platforms

RAG Scholar and Institution

Would like to know which individuals and organizations are researching RAG? The Scholar and Institution databases summarize the main researchers and their institutions in the field of RAG based on papers and blogs. For the purpose of statistical analysis, for a single article, we typically only count the first author, corresponding author, or laboratory head.

The following image is our RAG Scholar Gallery, displaying the institutions of relevant authors and their representative works. Red icons represent researchers in academia, while green icons represent practitioners in the industry. It can be seen that RAG is indeed a direction of great interest in both academia and industry.

The “High” label only represents a subjective evaluation of the researcher’s relevance to the research direction and RAG (usually a researcher’s work involves multiple directions), for reference only, and does not imply any other meaning.

RAG Scholar Gallery

Evaluation, Benchmark and Toolkit ToolKits

The evaluation of RAG has always been a very important step. Here we will summarize the current evaluation tools and benchmark tests. And the technology frameworks that we can use when developing the RAG system, such as Langchain, LLamaIndex, etc.

Evaluation method and Benchmark In RAG

…

More content awaits your exploration.

What else ？

Comment！

You can comment and engage in friendly discussions on any topic of interest to you, such as Database or Pages details. Please maintain good social etiquette and refrain from discussing topics unrelated to RAG.

DUPLICATE！

💡 Anyone can clone the entire repository to local, and make more flexible modifications. Start building your own exclusive RAG knowledge base from here!

Contribute！

If you are interested in contributing to maintaining this project, please email us !

Who are we ?

This project is jointly initiated by

Haofen Wang (王昊奋) 、Meng Wang (王萌) ,Tongji University
Yun Xiong (熊贇) Shanghai Key Laboratory of Data Science, Fudan University

Contact Us

For questions and suggestions about this project, please contact:

Yunfan Gao (高云帆）Shanghai Research Institute for Intelligent Autonomous Systems (Tongji University) Email: gaoyunfan1602@gmail.com

For collaboration and other related matters, please contact:

Haofen Wang(王昊奋) Tongji University Email: haofen.wang@tongji.edu.cn
Meng Wang(王萌) Tongji University Email: mengwangtj@tongji.edu.cn
Yun Xiong(熊贇) Fudan University Email: yunx@fudan.edu.cn

Others

Our Survey：Retrieval-Augmented Generation for Large Language Models: A Survey

Our LLMs Evaluation Project: AI-Ceping !

Modular RAG and RAG Flow: Part II

OpenRAG — Mon, 29 Jan 2024 05:32:46 GMT

How to design your own RAG Flow?

In Part I, we primarily discussed the three-tier structure of modular RAG (Module Type - Module- Operator) and briefly mentioned the concept of RAG Flow.

Modular RAG and RAG Flow: Part Ⅰ

After defining Module and Operator, they can help us to view various RAG methods from a flow perspective. Each RAG can be arranged with a set of operators.

Framework of Modular RAG

So, under the paradigm of modular RAG, how should we design our RAG system?

In Part II, we will delve into the typical RAG Flow pattern, specific RAG Flow implementation, and best industry case.

Typical RAG Flow Pattern and Implementation

First, let’s explore the prominent patterns for RAG flow, along with the specific flows under each template, illustrating how different modules and operators are orchestrated.

In the context of RAG Flow, we will delineate three distinct flows for the fine-tuning stage and four flows for the inference stage.

Tuning Stage

Retriever Fine-tuning, Generator Fine-tuning, and Dual Fine-tuning.

Retriever FT

In the RAG Flow, common methods for fine-tuning the retriever include:

Direct fine-tuning of the retriever. Constructing a specialized dataset for retrieval and fine-tuning the dense retriever. For example, using open-source retrieval datasets or constructing one based on your domain-specific data.
Adding trainable Adapter modules. Sometimes, direct fine-tuning of the API-base embedding model (e.g., OpenAI Ada-002 and Cohere) is not feasible. Incorporating an Adapter module can enhance the representation of your data. Additionally, the adapter module facilitates better alignment with downstream tasks, whether for task-specific (e.g., PCRA) or general purposes (e.g., AAR).
LM-supervised Retrieval (LSR). Fine-tuning the retriever based on the results generated by LLM.
LLM Reward RL : Still using the LLM output results as the supervisory signal. Employing reinforcement learning to align the retriever with the generator. The whole retrieval process is disassembled in the form of a generative Markov chain.

Typical RAG Flow Pattern for Retriever FT

Generator FT

The primary methods for fine-tuning a generator in RAG Flow include:

Direct fine-tuning. Fine-tuning through an external dataset can supplement the generator with additional knowledge. Another benefit is the ability to customize input and output formats. By setting theQ&A format, LLM can understand specific data formats and output according to instructions.
GPT-4 distillation. When using on-premise deployment of open-source models, a simple and effective method is to use GPT-4 to batch construct fine-tuning data to enhance the capabilities of the open-source model.
Reinforcement Learning from LLM/Human Feedback. Reinforcement learning based on feedback from the final generated answers. In addition to using human evaluations, GPT-4 can also serve as an evaluative judge.

Typical RAG Flow Pattern for Generator FT

Dual FT

In the RAG system, fine-tuning both the retriever and the generator simultaneously is a unique feature of the RAG system. It is important to note that the emphasis of system fine-tuning is on the coordination between the retriever and the generator. Fine-tuning the retriever and the generator separately separately belongs to the combination of the former two, rather than being part of Dual FT.

Typical RAG Flow Pattern for Dual FT

An exemplary implementation is RA-DIT, which fine-tunes both the LLM and the retriever. The LM-ft component updates the LLM to maximize the likelihood of the correct answer given the retrieval-augmented instructions while the R-ft component updates the retriever to minimize the KL-Divergence between the retriever score distribution and the LLM preference.

The framework employs a on-premises Llama as the generator and a state-of-the-art dual-encoder based dense retriever, DRAGON+, as the retriever.

Following REPLUG, RA-DIT retrieve relevant text chunks based on the language model prompt. Each retrieved chunk is prepended to the prompt, and the predictions from multiple chunks are computed in parallel and ensembled by weighted possibilty to produce the final output.

RAG Flow in RA-DIT

Inference Stage

In the inference stage, we have distilled four typical RAG Flow patterns.

Sequential

The sequential structure of the RAG Flow organizes the modules and operators of RAG in a linear pipeline, as depicted in the following diagram. If it includes both Pre-Retrieval and Post-Retrieval module types, it represents the typical Advanced RAG paradigm; otherwise, it embodies the typical Naive RAG paradigm.

Sequential RAG Flow Pattern

The most widely used RAG Pipeline currently is the Sequential, which commonly includes Query Rewrite or HyDE before retrieval and Rerank operator after retrieval, such as in the case of QAnything.

The most commonly used sequential RAG Flow

Rewrite-Retrieve-Read (RRR) is also a typical sequential structure. The Query Rewrite module is a smaller trainable language model, and in the context of reinforcement learning, the optimization of the rewriter is formalized as a Markov decision process, with the final output of the LLM serving as the reward. The retriever utilizes a sparse encoding model, BM25.

RAG Flow in RRR

Conditional

The RAG Flow with conditional structure involves selecting different RAG pathways based on different conditions. Typically, this is accomplished through a Routing module that determines the route based on query keywords or semantics.

Different routes are chosen based on the type of question, directing to different flows for specific scenarios. For instance, when users inquire about serious issues, political matters, or entertainment topics, the tolerance for answers from large models varies. Different routing branches usually differ in retrieval sources, retrieval processes, configuration , model , and prompts.

Conditional RAG Flow Pattern

A classic implementation of Conditional RAG is the Semantic Router.

Branching

The RAG Flow with a branching structure differs from the conditional approach in that it involves multiple parallel branches, as opposed to selecting one branch from multiple options in the conditional approach. Structurally, it can be categorized into two types:

Pre-Retrieval Branching (Multi-Query, Parallel Retrieval). This involves expanding the original query to obtain multiple sub-queries, and then conducting separate retrieval for each sub-query. After retrieval, the approach allows for immediate answer generation based on the sub-questions and the corresponding retrieval content. Alternatively, it may involve using only the expanded retrieval content and merging it into a unified context for generation.
Post-Retrieval Branching (Single Query, Parallel Generation). This approach maintains the original query and retrieves multiple document chunks. Subsequently, it concurrently uses the original query and each document chunks for generation, and finally merges the generated results together.

Branching RAG Flow Pattern

REPLUG embodies a classic post-retrieval branching structure, wherein the probability of each token is predicted for each branch. Through weighted possibility ensemble, the different branches are aggregated, and the final generation result is used to fine-tune the retriever, known as Contriever, through feedback.

RAG Flow in REPLUG

Loop

The RAG Flow with a loop structure, an important characteristic of Modular RAG, involves interdependent retrieval and reasoning steps. It typically includes a Judge module for flow control.This can be further categorized into iterative, recursive, and adaptive (active) retrieval approaches.

Loop RAG Flow Pattern

Iterative Retrieval

At times, a single retrieval and generation may not effectively address complex questions requiring extensive knowledge. Therefore, an iterative approach can be used in RAG, typically involving a fixed number of iterations for retrieval.

An exemplary case of iterative retrieval is ITER-RETGEN, which iterates retrieval-augmented generation and generation-augmented retrieval. Retrieval-augmented generation outputs a response to a task input based on all retrieved knowledge. In each iteration, ITER-RETGEN leverages the model output from the previous iteration as a specific context to help retrieve more relevant knowledge. Termination of the loop is determined by a predefined number of iterations.

RAG Flow in ITER-RETGEN

Recursive Retrieval

The characteristic feature of recursive retrieval, as opposed to iterative retrieval, is its clear dependency on the previous step and its continuous deepening of retrieval. Typically, there is a termination mechanism as an exit condition for recursive retrieval. In RAG systems, recursive retrieval usually involves Query Transformation, relying on the newly rewritten query for each retrieval.

RAG Flow in ToC

A typical implementation of recursive retrieval, such as ToC, involves recursively executing RAC (Recursive Augmented Clarification) to gradually insert sub-nodes into the clarification tree from the initial ambiguous question (AQ). At each expansion step, paragraph re-ranking is performed based on the current query to generate a disambiguous Question (DQ). The exploration of the tree concludes upon reaching the maximum number of valid nodes or the maximum depth. Once the clarification tree is constructed, ToC gathers all valid nodes and generates a comprehensive long-text answer to address AQ.

Adaptive(Active) Retrieval

With the advancement of RAG, there has been a gradual shift beyond passive retrieval to the emergence of adaptive retrieval, also known as proactive retrieval, which is partly attributed to the powerful capabilities of LLM. This shares a core concept with LLM Agent.

RAG systems can actively determine the timing of retrieval and decide when to conclude the entire process and produce the final result. Based on the criteria for judgment, this can be further categorized into Prompt-based and Tuning-based approaches.

Prompt-base.The Prompt-based approach involves controlling the flow using Prompt Engineering to direct LLM. A typical implementation example is FLARE. Its core concept is that the language model should only retrieve when essential knowledge is lacking, to avoid unnecessary or inappropriate retrieval in an enhanced LM. FLARE iteratively generates the next provisional sentence and checks for the presence of low-probability tokens. If found, the system retrieves relevant documents and regenerates the sentence.

RAG Flow in FLARE

Tuning-base. The Tuning-based approach involves fine-tuning LLM to generate special tokens, thereby triggering retrieval or generation. This concept can be traced back to Toolformer, where the generation of specific content assists in invoking tools. In RAG systems, this approach is used to control both retrieval and generation steps. A typical case is Self-RAG. Specifically:

1.Given an input prompt and the preceding generation result, first predict whether the special token “Retrieve” is helpful for enhancing the continued generation through paragraph retrieval.

2.If retrieval is needed, the model generates: a critique token to evaluate the retrieved passage’s relevance, the next response segment, and a critique token to evaluate if the information inthe response segment is supported by the passage.

3.Finally, a critique token evaluates the overall utility of the response and selects the optimal result as the final output.

RAG Flow in Self-RAG

Best Industry Case

In the preceding sections, we have delved into various research papers, with their distinctive feature being an emphasis on addressing specific details and intricacies. RAG, on the other hand, stands out as a technology that shines brightly in the industrial domain, enabling LLM to be applied across a wide range of task scenarios. This chapter will shed light on several industry-leading RAG practices from the perspective of RAG Flow, offering insights into how to effectively combine and construct the flow of RAG in real-world application scenarios.

OpenAI

The insights from OpenAI’s Demo Day presentation do not fully represent the actual operations of OpenAI.

In their efforts to enhance the success of RAG, the OpenAI team started with a 45% accuracy rate and experimented with various methods, identifying which methods were ultimately adopted for production. They explored hypothetical document embeddings (HyDE), fine-tuning embeddings, and other methods, but the results were not satisfactory. By experimenting with different-sized chunks of information and embedding different content sections, they were able to increase the accuracy to 65%. Through reranking and methods tailored to handle different types of questions, they further improved the accuracy to 85%. Ultimately, by combining prompt engineering, query expansion, and other methods, they achieved a 98% accuracy rate.

OpenAI RAG Flow

The team emphasized the powerful potential of model fine-tuning and the integration of RAG, particularly in approaching industry-leading levels without the use of complex techniques, solely through simple model fine-tuning and prompt engineering.

https://medium.com/media/330e73429992c21ec3e36c1d2efc77df/href

Baichuan

Based on the publicly available information from various sources, the available data is limited, and the author has made some speculative assumptions about certain details. See the original (in Chinese)

Baichuan, drawing inspiration from Meta’s CoVe, has devised a method to deconstruct complex prompts into multiple independent and parallel retrievable search-friendly queries. This enables large models to conduct targeted knowledge base searches for each sub-query, thereby providing more accurate and detailed answers and reducing spurious outputs. Additionally, they have leveraged their proprietary TSF (Think-Step Further) to infer and unearth the deeper underlying questions behind user input, allowing for a more precise and comprehensive understanding of user intent. While the technical details of TSF have not been disclosed, it is speculated to be an enhancement of the Step-back prompting method.

In the retrieval step, Baichuan Intelligence has developed the Baichuan-Text-Embedding vector model, pre-trained on high-quality Chinese data comprising over 1.5 trillion tokens. They have addressed the issue of batch size dependency in contrastive learning through a proprietary loss function. This vector model has surpassed the C-MTEB.

Additionally, they have introduced sparse retrieval and rerank models (not disclosed.), forming a hybrid retrieval approach that combines vector retrieval with sparse retrieval in parallel, significantly enhancing the recall rate to 95%.

Furthermore, they have introduced self-critique, enabling large models to introspect on the retrieved content based on prompt, relevance, and utility, and undergo a secondary review to select the most matching and high-quality candidate content.

Baichuan RAG Flow

Given the numerous branches in the entire Baichuan RAG Flow and the lack of specific disclosure, it is reasonable to speculate that reranking and selection entail reordering and screening of all materials, whether retrieved or generated from other branches.

Databricks

Databricks, as a leading service provider in the big data domain, has maintained its distinctive features and advantages in RAG design.

When a user inputs a question, the system retrieves relevant information from pre-processed text vector indices, incorporating prompt engineering to generate responses. The upper half, the Unstructured Data Pipeline, follows the mainstream RAG approach and does not exhibit any particular uniqueness.

Databricks RAG Flow

The lower half, the Structured Data Pipeline, represents Databricks’ feature engineering process and is the most significant aspect of Databricks’ RAG implementation. Leveraging its expertise in big data, Databricks conducts additional retrieval from its highly accurate data storage, fully utilizing its advantage in Real Time Data Serving. It is evident that Databricks’ strategy in the era of GenAI is to empower RAG applications with broad market demand, integrating its robust Delta lake processing capabilities with generative AI technology to build an integrated solution, and promoting this unified service to its customers.

RAG (Retrieval Augmented Generation) on Databricks

Conclusion

The article delineates three patterns of fine-tuning stages, four patterns of inference stages, as well as the specific flow implementations in seven papers and three industrial best practices. The overall framework is illustrated as follows.

As we also mentioned in Part 1, summarization and abstraction of the RAG paradigm are crucial in this era of rapid technological advancement. It is essential to transcend specific implementations and comprehend the current technological features and trends from a higher dimension, in order to grasp the direction of future development.

Modular RAG Technical Map

Modular RAG and RAG Flow: Part Ⅰ

OpenRAG — Wed, 24 Jan 2024 16:10:35 GMT

A compressive and high-level summarization of RAG .

In Part I, we will focus the concept and components of Modular RAG, containing 6 module types, 14 modules and 40+ operators.

Intro

Over the past year, the concept of Retrieval-Augmented Generation (RAG) as a method for implementing LLM applications has garnered considerable attention. We have authored a comprehensive survey on RAG , delving into the shift from Naive RAG to Advanced RAG and Modular RAG. However, the survey primarily scrutinized RAG technology through the lens of Augmentation (e.g. Augmentation Source/Stage/Process).

This piece will specifically center on the Modular RAG paradigm. We further defined a three-tier Modular RAG paradigm, comprising Module Type, Module, and Operator. Under this paradigm, we expound upon the core technologies within the current RAG system, encompassing 6 major Module Types, 14 Modules, and 40+Operators, aiming to provide a comprehensive understanding of RAG.

By orchestrating different operators, we can derive various RAG Flows, a concept we aim to elucidate in this article. Drawing from extensive research, we have distilled and summarized typical patterns, several specific implementation cases and best industry cases. (Due to space constraints, this part will be addressed in Part II.)

The objective of this article is to offer a more sophisticated comprehension of the present state of RAG development and to pave the way for future advancements. Modular RAG presents plenty opportunities, facilitating the definition of new operators, modules, and the configuration of new Flows.

The Figures in our RAG Survey

What is Modular RAG？

The progress of RAG has brought about a more diverse and flexible process, as evidenced by the following crucial aspects:

Enhanced Data Acquisition: RAG has expanded beyond traditional unstructured data and now includes semi-structured and structured data, with a focus on preprocessing structured data to improve retrieval and reduce the model’s dependence on external knowledge sources.
Incorporated Techniques: RAG is integrating with other techniques, including the use of fine-tuning, adapter modules, and reinforcement learning to strengthen retrieval capabilities.
Adaptable Retrieval Process: The retrieval process has evolved to support multi-round retrieval enhancement, using retrieved content to guide generation and vice versa. Additionally, autonomous judgment and the use of LLM have increased the efficiency of answering questions by determining the need for retrieval.

Definition of Modular RAG

Above, we can see that the rapid development of RAG has surpassed the Chain-style Advanced RAG paradigm, showcasing a modular characteristic. To address the current lack of organization and abstraction, we propose a Modular RAG approach that seamlessly integrates the development paradigms of Naive RAG and Advanced RAG.

Modular RAG presents a highly scalable paradigm, dividing the RAG system into a three-layer structure of Module Type, Modules, and Operators. Each Module Type represents a core process in the RAG system, containing multiple functional modules. Each functional module, in turn, includes multiple specific operators. The entire RAG system becomes a permutation and combination of multiple modules and corresponding operators, forming what we refer to as RAG Flow. Within the Flow, different functional modules can be selected in each module type, and within each functional module, one or more operators can be chosen.

The relationship with the previous paradigm

The Modular RAG organizes the RAG system in a multi-tiered modular form. Advanced RAG is a modular form of RAG, and Naive RAG is a special case of Advanced RAG. The relationship between the three paradigms is one of inheritance and development.

Opportunities in Modular RAG

The benefits of Modular RAG are evident, providing a fresh and comprehensive perspective on existing RAG-related work. Through modular organization, relevant technologies and methods are clearly summarized.

Research perspective. Modular RAG is highly scalable, facilitating researchers to propose new Module Types, Modules, and operators based on a comprehensive understanding of the current RAG development.
Application perspective. The design and construction of RAG systems become more convenient, allowing users to customize RAG Flow based on their existing data, usage scenarios, downstream tasks, and other requirements. Developers can also reference current Flow construction methods and define new flow and patterns based on different application scenarios and domains.

The Framework of Modular RAG

Module Type — Module — Operators

In this chapter, we will delve into the three-tier structure and constrcuct a technical roadmap for RAG. Due to space constraints, we will refrain from delving into technical specifics; however, comprehensive references will be provided for further reading.

1. Indexing

Indexing, the process of breaking down text into manageable chunks, is a crucial step in organizing the system, facing three main challenges:

Incomplete Content Representation.The semantic information of chunks is influenced by the segmentation method, resulting in the loss or submergence of important information within longer contexts.
Inaccurate Chunk Similarity Search. As data volume increases, noise in retrieval grows, leading to frequent matching with erroneous data, making the retrieval system fragile and unreliable.
Unclear Reference Trajectory. The retrieved chunks may originate from any document, devoid of citation trails, potentially resulting in the presence of chunks from multiple different documents that, despite being semantically similar, contain content on entirely different topics.

Chunk Optimization

Larger chunks can capture more context, but they also generate more noise, requiring longer processing time and higher costs. While smaller chunks may not fully convey the necessary context, they do have less noise.

Sliding Window

One simple way to balance these demands is to use overlapping chunks.By employing a sliding window, semantic transitions are enhanced. However, limitations exist, including imprecise control over context size, the risk of truncating words or sentences, and a lack of semantic considerations.

Small-to-Big

The key idea is to separate the chunks used for retrieval from the chunks used for synthesis. Using smaller chunks can improve the accuracy of retrieval, while larger chunks can provide more context information.

Specifically, one approach could involve retrieving smaller chunks and then referencing parent IDs to return larger chunks. Alternatively, individual sentences could be retrieved, and the surrounding text window of the sentence returned.

Detailed information and LlamaIndex Implementation.

How to Make Your LLM More Accurate with RAG & Fine-Tuning | Towards Data Science

Summary

It is akin to the Small-to-Big concept, where a summary of larger chunks is generated first, and the retrieval is performed on the summary. Subsequently, a secondary retrieval can be conducted on the larger chunks.

Metadata Attachment

Chunks can be enriched with metadata information such as page number, file name, author, timestamp, summary, or the questions that the chunk can answer. Subsequently, retrieval can be filtered based on this metadata, limiting the scope of the search. See the implementation in LlamaIndex.

Structural Oraginzation

One effective method for enhancing information retrieval is to establish a hierarchical structure for the documents. By constructing chunks structure, RAG system can expedite the retrieval and processing of pertinent data.

Hierarchical Index

In the hierarchical structure of documents, nodes are arranged in parent-child relationships, with chunks linked to them. Data summaries are stored at each node, aiding in the swift traversal of data and assisting the RAG system in determining which chunks to extract. This approach can also mitigate the illusion caused by block extraction issues.

The methods for constructing a structured index primarily include：

Structural awareness.paragraph and sentence segmentation in docs
Content awareness .inherent structure in PDF, HTML, Latex
Semantic awareness.Semantic recognition and segmentation of text based on NLP techniques, such as leveraging NLTK.

Check Arcus’s hierarchical index at large-scale.

KG Organization Docs

The utilization of Knowledge Graphs (KGs) in constructing the hierarchical structure of documents contributes to maintaining consistency. It delineates the connections between different concepts and entities, markedly reducing the potential for illusions.

Another advantage is the transformation of the information retrieval process into instructions that LLM can comprehend, thereby enhancing the accuracy of knowledge retrieval and enabling LLM to generate contextually coherent responses, thus improving the overall efficiency of the RAG system.

Check Neo4j implementation and LllmaIndex Neo4j query engine.

For organizing multiple documents using KG, you can refer to this research paper KGP:Knowledge Graph Prompting for Multi-Document Question Answering.

Knowledge Graph Prompting: A New Approach for Multi-Document Question Answering

2. Pre-Retrieval

One of the primary challenges with Naive RAG is its direct reliance on the user’s orginal query as the basis for retrieval. Formulating a precise and clear question is difficult, and imprudent queries result in subpar retrieval effectiveness.

The primary challenges in this stage include:

Poorly worded queries. The question itself is complex, and the language is not well-organized.
language complexity & ambiguity.Language models often struggle when dealing with specialized vocabulary or ambiguous abbreviations with multiple meanings. For instance, they may not discern whether “LLM” refers to large language model or a Master of Laws in a legal context.

Query Expansion

Expanding a single query into multiple queries enriches the content of the query, providing further context to address any lack of specific nuances, thereby ensuring the optimal relevance of the generated answers.

Multi-Query

By employing prompt engineering to expand queries via LLMs, these queries can then be executed in parallel. The expansion of queries is not random, but rather meticulously designed. Two crucial criteria for this design are the diversity and coverage of the queries.

One of the challenges of using multiple queries is the potential dilution of the user’s original intent. To mitigate this, we can instruct the model to assign greater weight to the original query in prompt engineering.

Sub-Query

The process of sub-question planning represents the generation of the necessary sub-questions to contextualize and fully answer the original question when combined. This process of adding relevant context is, in principle, similar to query expansion. Specifically, a complex question can be decomposed into a series of simpler sub-questions using the least-to-most prompting method.

Sub Question Query Engine - LlamaIndex 🦙 0.9.36

CoVe

Another approach to query expansion involves the use of the Chain-of-Verification(CoVe) proposed by Meta AI. The expanded queries undergo validation by LLM to achieve the effect of reducing hallucinations. Validated expanded queries typically exhibit higher reliability.

Query Transformation

Retrieve and generate using a transformed query instead of the user’s original query.

Rewrite

The original queries are not always optimal for LLM retrieval, especially in real-world scenarios. Therefore, we can prompt LLM to rewrite the queries. In addition to using LLM for query rewriting, specialized smaller language models, such as RRR（Rewrite-retrieve-read), can also be utilized.

The implementation of the Query Rewrite method in the Taobao promotion system, known as BEQUE:Query Rewriting for Retrieval-Augmented Large Language Models, has notably enhanced recall effectiveness for long-tail queries, resulting in a rise in GMV.

HyDE

When responding to queries, LLM constructs hypothetical documents (assumed answers) instead of directly searching the query and its computed vectors in the vector database. It focuses on embedding similarity from answer to answer rather than seeking embedding similarity for the problem or query. In addition, it also includes Reverse HyDE, which focuses on retrieval from query to query.

The core idea of bothHyDE and Reverse HyDE is to bridge the map between query and answer.

Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE)

Step-back Prompting

Using the Step-back Prompting method proposed by Google DeepMind, the original query is abstracted to generate a high-level concept question (step-back question). In the RAG system, both the step-back question and the original query are used for retrieval, and both the results are utilized as the basis for language model answer generation.

A New Prompt Engineering Technique Has Been Introduced Called Step-Back Prompting

Query Routing

Based on varying queries, routing to distinct RAG pipeline,which is suitable for a versatile RAG system designed to accommodate diverse scenarios.

Metadata Router/ Filter

The first step involves extracting keywords (entity) from the query, followed by filtering based on the keywords and metadata within the chunks to narrow down the search scope.

Semantic Router

Another method of routing involves leveraging the semantic information of the query. Specific apporch see Semantic Router.Certainly, a hybrid routing approach can also be employed, combining both semantic and metadata-based methods for enhanced query routing.

Check Semantic router repo.

Beyond Basic Chatbots: How Semantic Router is Changing the Game

Query Construction

Converting a user’s query into another query language for accessing alternative data sources. Common methods include:

Text-to-Cypher
Text-to-SQL

In many scenarios, structured query languages (e.g., SQL, Cypher) are often used in conjunction with semantic information and metadata to construct more complex queries. For specific details, please refer to the Langchain blog.

Query Construction

3 Retrieval

The retrieval process plays a crucial role in RAG. Leveraging powerful PLMs enables the effective representation of queries and text in latent spaces, facilitating the establishment of semantic similarity between questions and documents to support retrieval.

Three main considerations need to be taken into account :

Retrieval Efficiency
Embedding Quality
Alignment of tasks , data and models

Retriver Selection

Since the release of ChatGPT, there has been a frenzy of development in embedding models.Hugging Face’s MTEB leaderboard evaluates nearly all available embedding models across 8 tasks — Clustering,Classification,Bitext Ming, Pair Classification, Reranking, Retrieval, Semantic Text Similarity (STS), and Summarization, covering 58 dataset Additionally, C-MTEB focuses on evaluating the capabilities of Chinese embedding models, covering 6 tasks and 35 datasets.

When constructing RAG applications, there is no one-size-fits-all answer to “which embedding model to use.” However, you may notice that specific embeddings are better suited for particular use cases.

Check the MTEB/C-MTEB Leaderboard.

MTEB Leaderboard - a Hugging Face Space by mteb

Sparse Retriever

While sparse encoding models may be considered a somewhat antiquated technique, often based on statistical methods such as word frequency statistics, they still hold a certain place due to their higher encoding efficiency and stability. Common coefficient encoding models include BM25 and TF-IDF.

Dense Retriever

Neural network-based dense encoding models encompass several types:

Encoder-Decoder language models built on the BERT architecture, such as ColBERT.
Comprehensive multi-task fine-tuning models like BGE and Baichuan-Text-Embedding.
Cloud API-based models such as OpenAI-Ada-002 and Cohere Embedding.
Next-generation accelerated encoding framework Dragon+, designed for large-scale data applications.
Mix/hybrid Retrieval

Two embedding approaches capture different relevance features and can benefit from each other by leveraging complementary relevance information. For instance, sparse retrieval models can be used to provide initial search results for training dense retrieval models. Additionally, PLMs can be utilized to learn term weights to enhance sparse retrieval. Specifically, it also demonstrates that sparse retrieval models can enhance the zero-shot retrieval capability of dense retrieval models and assist dense retrievers in handling queries containing rare entities, thereby improving robustness.

Image from IVAN ILIN:Advanced RAG Techniques: an Illustrated Overview

Retriever Fine-tuning

In cases where the context may diverge from what the pre-trained model deems similar in the embedding space, particularly in highly specialized fields like healthcare, law, and other domains abundant in proprietary terminology, adjusting the embedding model can address this issue. While this adjustment demands additional effort, it can substantially enhance retrieval efficiency and domain alignment.

SFT

You can construct your own fine-tuning dataset based on domain-specific data, a task that can be swiftly accomplished using LlamaIndex.

LSR (LM-supervised Retriever)

In contrast to directly constructing a fine-tuning dataset from the dataset, LSR utilizes the LM-generated results as supervisory signals to fine-tune the embedding model during the RAG process.

RL(Reinforcement learning)

Inspired by RLHF(Reinforcement Learning fromHuman Feedback), utilizing LM-based feedback to reinforce the Retriever through reinforcement learning.

Adapter

At times, fine-tuning an entire retriever can be costly, especially when dealing with API-based retrievers that cannot be directly fine-tuned. In such cases, we can mitigate this by incorporating an adapter module and conducting fine-tuning.Another benefit of adding an adapter is the ability to achieve better alignment with specific downstream tasks.

Task Specific.PRCA: Fitting Black-Box Large Language Models for Retrieval QuestionAnswering via Pluggable Reward-Driven Contextual Adapter.
Task Agnostic.The AAR(Augmentation -Adapted Retriver) introduces a universal adapter designed to accommodate multiple downstream tasks.

4 Post-Retrieval

Retrieving entire document chunks and feeding them directly into the LLM’s contextual environment is not an optimal choice. Post-processing the documents can aid LLM in better leveraging the contextual information.

The primary challenges include:

Lost in the middle. Like humans, LLM tends to remember only the beginning and end of long texts, while forgetting the middle portion.
Noise/anti-fact chunks. Retrieved noisy or factually contradictory documents can impact the final retrieval generation.
Context Window. Despite retrieving a substantial amount of relevant content, the limitation on the length of contextual information in large models prevents the inclusion of all this content.

Rerank

Rerank the retrieved document chunks without altering their content or length, to enhance the visibility of the more crucial document chunks for LLM. In specific terms：

Rule-base Rerank

According to certain rules, metrics are calculated to rerank chunks. Common metrics include:

Diversity
Relevance
MRR (Maximal Marginal Relevance, 1998)

The idea behind MMR is to reduce redundancy and increase result diversity, and it is used for text summarization. MMR selects phrases in the final key phrase list based on a combined criterion of query relevance and information novelty.

Check there rerank implementation in HayStack

Enhancing RAG Pipelines in Haystack: Introducing DiversityRanker and LostInTheMiddleRanker

Model-base Rerank

Utilize a language model to reorder the document chunks, with options including:

Encoder-Decoder models from the BERT series, such as SpanBERT
Specialized reranking models, such as Cohere rerank or bge-raranker-large
General large language models, such as GPT-4

Compression and Selection

A common misconception in the RAG process is the belief that retrieving as many relevant documents as possible and concatenating them to form a lengthy retrieval prompt is beneficial. However, excessive context can introduce more noise, diminishing the LLM’s perception of key information and leading to issues such as “ lost in the middle” . A common approach to address this is to compress and select the retrieved content.

（Long)LLMLingua

By utilizing aligned and trained small language models, such as GPT-2 Small or LLaMA-7B, the detection and removal of unimportant tokens from the prompt is achieved, transforming it into a form that is challenging for humans to comprehend but well understood by LLMs. This approach presents a direct and practical method for prompt compression, eliminating the need for additional training of LLMs while balancing language integrity and compression ratio.

check the LLMLingua project.

LLMLingua | Explore the special language for LLMs via Prompt Compression

Recomp

Recomp introduces two types of compressors: an extractive compressor that selects pertinent sentences from retrieved documents, and an abstractive compressor that produces concise summaries by amalgamating information from multiple documents. Both compressors are trained to enhance the performance of language models on end tasks when the generated summaries are prepended to the language models’ input, while ensuring the conciseness of the summary. In cases where the retrieved documents are irrelevant to the input or do not provide additional information to the language model, compressor can return an empty string, thereby implementing selective augmentation.

Selective Context

By identifying and removing redundant content in the input context, the input can be streamlined, thus improving the language model’s reasoning efficiency. Selective Context is akin to a “stop-word removal” strategy. In practice, selective context assesses the information content of lexical units based on the self-information computed by the base language model. By retaining content with higher self-information, this method offers a more concise and efficient textual representation for language model processing, without compromising their performance across diverse applications. However, it overlooks the interdependence between compressed content and the alignment between the targeted language model and the small language model utilized for prompting compression.

Tagging-Filter

Tagging is a relatively intuitive and straightforward approach. Specifically, the documents are first labeled, and then filtered based on the metadata of the query.

Tagging | 🦜️🔗 Langchain

LLM-Critique

Another straightforward and effective approach involves having the LLM evaluate the retrieved content before generating the final answer. This allows the LLM to filter out documents with poor relevance through LLM critique. For instance, in Chatlaw, the LLM is prompted to self-suggestion on the referenced legal provisions to assess their relevance.

5 Generation

Utilize the LLM to generate answers based on the user’s query and the retrieved context information.

Generator Selection

Depending on the scenario, the choice of LLM can be categorized into the following two types:

Cloud API-base Generator

Cloud API-based Utilize third-party LLMs by invoking their APIs, such as OpenAI’s ChatGPT, GPT-4, and Anthropic Claude, among others. Benefits:

No server pressure
High concurrency
Ability to use more powerful models

Drawbacks:

Data passes through third parties, leading to data privacy concerns
Inability to adjust the model (in the vast majority of cases)
On-Premises

Locally deployed open-source or self-developed LLMs, such as the Llama series, GLM, and others.The advantages and disadvantages are opposite to those of Cloud API-based models. Locally deployed models offer greater flexibility and better privacy protection but require higher computational resources.

Generator Fine-tuning

In addition to directl LLM usage, targeted fine-tuning based on the scenario and data characteristics can yield better results. This is also one of the greatest advantages of using an on-premise setup. Common fine-tuning methods include the following:

SFT

When LLMs lack data in a specific domain, additional knowledge can be provided to the LLM through fine-tuning. Huggingface’s fine-tuning data can also be used as an initial step.

Another benefit of fine-tuning is the ability to adjust the model’s input and output. For example, it can enable LLM to adapt to specific data formats and generate responses in a particular style as instructed.

Aligning LLM outputs with human or retriever preferences through reinforcement learning is a potential approach. For instance, manually annotating the final generated answers and then providing feedback through reinforcement learning. In addition to aligning with human preferences, it is also possible to align with the preferences of fine-tuned models and retrievers.

Distillation

When circumstances prevent access to powerful proprietary models or larger parameter open-source models, a simple and effective method is to distill the more powerful models(e.g. GPT-4).

Dual FT

Fine-tuning both Generator and Retriever to align their preferences. A typical approach, such as RA-DIT, aligns the scoring functions between Retriever and Generator using KL divergence.

6 Orchestration

Orchestration refers to the modules used to control the RAG process. RAG no longer follows a fixed process, and it involves making decisions at key points and dynamically selecting the next step based on the results. This is also one of the key features of modularized RAG compared to Naive RAG.

Scheduling

The Judge module assesses critical point in the RAG process, determining the need to retrieve external document repositories, the satisfaction of the answer, and the necessity of further exploration. It is typically used in recursive, iterative, and adaptive retrieval. Specifically, it mainly includes the following two operators:

Rule-base

The next course of action is determined based on predefined rules. Typically, the generated answers are scored, and then the decision to continue or stop is made based on whether the scores meet predefined thresholds. Common thresholds include confidence levels for tokens.

Prompt-base

LLM autonomously determines the next course of action. There are primarily two approaches to achieve this. The first involves prompting LLM to reflect or make judgments based on the conversation history, as seen in the ReACT framework. The benefit here is the elimination of the need for fine-tuning the model. However, the output format of the judgment depends on the LLM’s adherence to instructions. A prompt-base case is FLARE.

Tuning-base

The second approach entails LLM generating specific tokens to trigger particular actions, a method that can be traced back to Toolformer and is applied in RAG, such as in Self-RAG.

Fusion

This concept originates from RAG Fusion. As mentioned in the previous section on Query Expansion, the current RAG process is no longer a singular pipeline. It often requires the expansion of retrieval scope or diversity through multiple branches. Therefore, following the expansion to multiple branches, the Fusion module is relied upon to merge multiple answers.

Possibility Ensemble

The fusion method is based on the weighted values of different tokens generated from multiple beranches, leading to the comprehensive selection of the final output. Weighted averaging is predominantly employed. See REPLUG.

RRF (Reciprocal Rank Fusion )

RRF, is a technique that combines the rankings of multiple search result lists to generate a single unified ranking. Developed in collaboration with the University of Waterloo (CAN) and Google, RRF produces results that are more effective than reordering chunks under any single branch.

Conclusion

The upcoming content on RAG Flow will be introduced in PART II, to be published soon.

As this is my first time publishing an article on Medium, I am still getting familiar with many features. Any feedback and criticism are welcome.