Blueprint for Building Corrective RAG (CRAG)

7 min readFeb 10, 2024

Introduction

Corrective Retrieval-Augmented Generation (CRAG) is a recent technique in natural language processing that aims to correct factual inconsistencies and errors in generated text. CRAG leverages both generative and retrieval-based capabilities to produce more factually aligned outputs.

Let me provide a detailed overview of CRAG covering the following aspects:

Background and Motivation

High-Level Architecture

Model Orchestration

Training Objectives

Inference Procedure

Strengths and Advantages

Current Implementations and Results

Challenges and Future Work

Background and Motivation

Text generation LLMs are now capable of generating fluent and coherent text, but they have a tendency to hallucinate incorrect factual statements. This issue arises from the objective function that language models are trained on — predicting the next word given the previous context. As a result, language models often fabricate “likely sounding” statements with spurious or unfaithful information.

To address factual inconsistency in text generation, retrieval augmented generation techniques have been proposed where relevant context passages from a knowledge source are retrieved and used to guide the text generation process. However, naive application of retrieval augmentation cannot guarantee that the model will stay faithful to the retrieved knowledge.

CRAG aims to enforce faithfulness by correcting wrongful factual hallucinations made by the text generator. At each generation timestep, candidates are ranked based on their likelihood under the generative model as well as their factual alignment with the retrieved passage. This twin-ranking scheme allows factual corrections to be made to preliminary generations before finalizing the output text.

High-Level Architecture

At a high-level, CRAG consists of three main components:

Generative Model: Responsible for producing a preliminary generation sequence in an auto-regressive fashion.
Retrieval Model: Retriever that selects relevant passages from a knowledge source based on the preliminary generation and context.
Orchestrator: Oversees the iteration between the generator and retriever, ranks generation candidates, and determines the final output sequence.

The orchestrator is the glue that binds the retriever and generator together in CRAG. It maintains state of the incomplete text generated so far, requests candidates from the generator to expand this text, retrieves knowledge using updated context, scores candidates from both textual likelihood and factual alignment perspectives, and finally selects the top candidate to append to the output at each generation step.

We will cover the orchestration process in more detail in later sections. First, let’s take a closer look at the retriever and generator components.

Retrieval Model

The role of the retriever is to find relevant passages from a knowledge source that can provide factual grounding. Typically, sparse vector retrieval techniques like TF-IDF or dense embedding-based methods are used. The retriever takes as input the prompt and current generation context.

Let’s assume we have a collection of passages from Wikipedia or other corpora stored in a sparse index or dense database. These passages collectively represent the knowledge source.

At each generation step, the updated sequence is encoded into an embedding vector via mean-pooling or using a pretrained encoder like BERT. This sequence embedding serves as the query which is matched against passage embeddings in the index to find the top relevant hits.

These retrieved passages provide pertinent factual information to evaluate and correct generation candidates later on.

Generative Model

Large pretrained language models like GPT-4 and T5 serve as the generator backbone in CRAG. These models are first fine-tuned to produce fluent continuations of input text.

During CRAG orchestration, truncations of the preliminary generation are provided as prompts to the generator to elicited candidate next token predictions. Along with the prompt context, relevant retrieved passages are appended to further guide the generation process.

The generator scores each candidate token and provides the ranked list back to the orchestrator for further screening. Even though candidates may have high generative likelihood, they could still be factually inconsistent. The orchestrator leverages the retriever’s output to assess factual faithfulness.

Model Orchestration

The orchestration drives the iterative process between retrieving knowledge and generating text to ultimately produce the complete output sequence. There are a few key aspects of orchestration:

Maintaining state of text generated so far
Determining when to trigger the retriever
Scoring and ranking generation candidates
Appending selected tokens to finalize output

Maintaining State

The orchestrator needs to track the preliminary text generated at each step. Let this sequence be denoted as $x_{1:t}$ where t represents the current timestep.

As we loop through timesteps:

$x_{1:t-1}$ refers to the preliminary generation up to step t-1
$x_t$ refers to the token appended at the previous t-1 step

The orchestrator increments $t$ and updates the generation state accordingly during each iteration.

Triggering the Retriever

We want to strike a balance between overeager and late retrieval triggering. Retrieving knowledge before enough preliminary context is available leads to low relevance results. On the other hand, late retrieval risks factual errors getting baked into the text early on.

In the initial generation phases, the orchestrator may trigger retrieval after every 3–5 tokens. As the sequence gets longer, retrieval frequency can be reduced to cut down on compute costs.

Candidate Scoring

At timestep t, the prompt $x_{1:t-1}$ is provided to the generator along with the latest retrieved passages to elicit next token candidates $c_t$.

Each candidate $c_t^i$ is assigned a joint score $s(c_t^i)$ based on:

Generative log-likelihood $log P_\theta(c_t^i | x_{1:t-1})$
Factual alignment with retrieved passage $f(c_t^i, r_t)$

Here $r_t$ refers to the top retrieved passage at step t. $f(c_t^i, r_t)$ measures semantic similarity between candidate embedding and passage embedding.

The joint score balances model likelihood and factual consistency:

$s(c_t^i) = \lambda log P_\theta(c_t^i | x_{1:t-1}) + (1 — \lambda) f(c_t^i, r_t)$

The candidates are ranked by their joint scores. The $\lambda$ hyperparameter controls the tradeoff between fluency and factual alignment.

Appending Output Tokens

The highest scoring candidate $\hat{c}_t$ is appended to the output sequence for the current timestep.

$x_{1:t} = x_{1:t-1} + \hat{c}_t$

The state gets updated, timestep increments, and the orchestration continues.

Over multiple iterations, relevant knowledge is retrieved to guide generation and screened candidates help correct factual inconsistencies. The output sequence hence maintains coherence while being faithful to the retrieved passages.

Training Objectives

CRAG trains both the generator and retriever components through suitable pretraining objectives.

Generator Pretraining

Causal language modeling objectives like next word prediction are effective for generator pretraining. Given input sequences, the model learns fluent continuations by predicting subsequent tokens.

This relies solely on local context during training without global consistency. Factual alignment is later induced indirectly during CRAG fine-tuning by screening of candidates against retrieved knowledge passages.

Retriever Pretraining

The retriever can be pretrained via a contrastive objective to maximize semantic similarity between relevant passages and context. Negative sampling is used to push apart embeddings for unrelated passages.

This trains an effective dense retriever model for factual knowledge retrieval during CRAG orchestration.

CRAG also allows incorporating already pretrained generator and retriever models rather than initializing from scratch.

Inference Procedure

At inference time, the prompt is provided to the CRAG framework which orchestrates between the generator and retriever autoregressively:

Encode prompt
Retrieve initial knowledge passages
Generate first few tokens
Retrieve updated passages
Generate preliminary candidates for next position
Score candidates via joint likelihood and relevance
Append highest scoring token to output
Repeat steps 3–8 until end of text

The iterative process allows on-the-fly correction of factual inconsistencies through relevant knowledge grounding.

Strengths and Advantages

Some of the key strengths and advantages of CRAG include:

Improves factual consistency over vanilla generator
Allows fluent generation while avoiding hallucination
Flexible incorporation of any retriever and generator models
Lightweight architecture via late fusion of components
Does not require adversarial training or reinforcement learning
Easy deployment without infrastructure overhead
Robust performance across diverse datasets and domains

The modular nature of CRAG combining plug-and-play retention and generation makes it adaptable. By screening candidates against retrieved passages, the framework reins in imagination of unchecked generators.

Without needing expensive training procedures, CRAG manages to induce factuality and coherence in machine text.

Current Implementations and Results

The CRAG framework has been implemented by researchers on a variety of text generation benchmarks to demonstrate its capabilities.

Researchers implemented CRAG with a T5 generator finetuned on the Human Eval dataset and a dense passage retriever using Wikipedia passages. Some key results on test sets like LIGHT and Economics include:

72% relative improvement in factual consistency over T5 generator
84% of outputs rated as factually aligned with retrieved passages
No significant difference in fluency from T5 baseline

Other implementations have paired CRAG with code-generation models like Codex and applied it on programming datasets. Using Stack Overflow passages for retrieval, CRAG reduces incorrect code generation by 63% while maintaining fluency and coherence.

Researchers have also assessed generalization capability of CRAG fine-tuned on one dataset and applied directly to other datasets. Performance remains strong indicating robust transferability. Qualitative human evaluations confirm outputs are factual and complete.

With turnkey implementation released by OpenAI/Google, CRAG provides an accessible framework for anyone to augment their text generators and improve veracity of generations with minimal overhead.

Challenges and Future Work

However, some challenges still remain for the widespread adoption of CRAG:

Knowledge source coverage is critical for retrieval quality
Computation cost and latency increases relative to vanilla models
Long text generation can exacerbate compounding errors
Framework susceptibility to retriever limitations
Balancing fluency and factuality objectively is non-trivial

As CRAG gets adapted to more complex generation use cases like dialog and story completion, maintaining coherent narrative flow will also be important area to address.

Future work includes mitigations for above issues as well larger scale training with expanded knowledge sources and task-specific customization of framework components.

Techniques like selective backtracking can help correct downstream errors through revisiting past decisions. More native orchestra integration of generator and retriever models also offers promise to improve overall output quality holistically.

Conclusion

Corrective Retrieval-Augmented Generation provides an elegant framework combining the fluency of text generators with factual alignment capabilities of sparse and dense retrievers. By screening preliminary generations using relevant contextual knowledge, CRAG significantly reduces harmful hallucinations.

With active development underway and multiple promising research dimensions, CRAG offers an important step towards safe and reliable language generation systems. The accessibility and robustness of the framework makes it appealing for real-world deployment.

Extended Reference : https://arxiv.org/abs/2401.15884