Advanced RAG for LLMs/SLMs

Bijit Ghosh
9 min readDec 24, 2023

--

Retrieval augmented generation (RAG) has emerged as a powerful technique for improving the capabilities of language models. By retrieving and conditioning on external knowledge, RAG allows models to generate more accurate, relevant, and comprehensive text.

There are three main types of RAG architectures — Naive, Modular, and Advanced RAG:

Naive RAG takes a monolithic model like GPT-3 and simply conditions it on retrieved evidence passages, appending them to the input context. This approach is simple but has efficiency and coherency issues.

Modular RAG breaks the system into explicit retriever, reranker, and generator modules. This provides more flexibility and specialization.

Advanced RAG enhances each module further with innovations like higher-order retrievers, cross-encoder rerankers, and evidence manipulation architectures. It unlocks greater accuracy and scalability.

I’ll focus on analyzing innovations in Advanced RAG systems and assess adaptations. Let me share the Advanced RAG techniques for both large language models (LLMs) and small language models (SLMs). I first explain the basics of the RAG framework — how retrieval and generation modules are combined to leverage external knowledge. Next, I dive into recent innovations in the three main components of RAG systems: the retriever module, the reranker module, and the generator module.

For each innovation, I highlight adaptations both for large transformer models with billions of parameters as well as smaller, more efficient models. I analyze tradeoffs between accuracy and efficiency and discuss what techniques work best for which models. I also examine hybrid approaches that utilize different model sizes for different RAG components.

With this sharing, you will have a deep understanding of the state-of-the-art techniques and considerations for developing performant and scalable RAG systems using both LLMs and SLMs. The content is intended to synthesize recent research results and provide both technical depth and practical guidance to engineers and researchers building real-world RAG applications.

RAG Framework Basics

At a high level, RAG systems contain three key modules:

  1. Retriever — retrieves passages of text from a knowledge source that are relevant to the context
  2. Reranker (optional) — rescores and reranks retrieved passages
  3. Generator — integrates context with retrieved passages to generate output text

The overall flow operates as follows:

The retriever identifies relevant passages from the knowledge source based on the context. These passages are optionally scored and reranked by the reranker. Finally, the generator conditions on both the context and retrieved passages to generate output text that incorporates external knowledge.

RAG systems leverage external textual knowledge to enhance language generation. Knowledge sources can include Wikipedia articles, news archives, domain-specific corpora, or any collection of textual content related to the generation task.

By conditioning generations on retrieved evidence, models can hallucinate less, answer questions more accurately, and generate more informative and relevant text. The outputs become augmented with external knowledge.

Next, we will do a deep dive on innovations within each RAG module, analyze tradeoffs between accuracy and efficiency, and highlight techniques tailored to both LLMs as well as more efficient SLMs.

Innovations in Retriever Models

The retriever module is responsible for identifying relevant external knowledge given the context. The key goal is high recall — retrieving passages that are potentially relevant even if not all retrievals will be used in the final output.

Common retriever architectures include dual-encoders and sparse models. Dual-encoder retrievers encode both the context and passages independently and score passage relevance based on vector similarity. Sparse retrievers directly estimate probability of relevance based on lexical term matching signals.

Recent innovations that improve retriever accuracy and efficiency for both LLMs and SLMs:

Knowledge Enhanced Dual-Encoders Standard dual-encoder retrievers encode queries and passages independently without modeling their interactions. This limits performance since the relevance signal depends solely on vector similarity.

Knowledge enhanced dual-encoders apply cross-attention between context and passages during encoding to model interactions explicitly. This improves relevance matching, especially for long or complex queries.

For LLMs, applying self-attention pooling and optionally self-attention over each passage excerpt further improves results. However, attention is still applied separately over the query and passages.

Alternatively, ColBERT models interleave query and passage tokens during encoding so attention learns interactions directly. Performance improves significantly but memory and compute requirements also increase greatly.

For more efficient SLMs, approaches like Poly-encoders show strong results while balancing accuracy and efficiency. The query is encoded with a bi-encoder. Passages are encoded with a cross-encoder conditioned on query summary vectors from the bi-encoder output. This lightweight design reduces computation while retaining strong relevance matching capability.

Term Weighting Optimization In sparse retrieval, relevance matching depends on lexical term weighting schemes. Advanced optimizers like ANCE and ANS learn to upweight important terms and downweight irrelevant terms automatically based on feedback data.

For LLMs, dense approximations of lexical signals followed by dimensionality reduction and tuning also boosts performance. However, store size and latency increase. The extreme approach Encode, Compress, Tokenize (ECT) works best for massive models but requires significant infrastructure optimization.

For SLMs, directly optimizing term weights based on bandit feedback works well. Gains can be further improved by initializing weights from simple yet fast heuristic functions before tuning. Computation cost is also reduced by using approximate nearest neighbor search during retrieval.

Integration of Semantic Term Matching Both dual-encoders and sparse models rely primarily on lexical term matching signals. Performance can be boosted by additionally modeling semantic relevance between queries and passages.

Approaches like Condenser integrate dense embedding similarity search efficiently into the sparse retrieval pipeline. Embedding augmentations based on knowledge enhanced dual-encoders also improve semantic relevance modeling for long-form queries.

For LLMs, maximum inner product search efficiently indexes passages by semantic embedding vectors while retaining sub-linear query efficiency. However, encoder size, indexing latency, and index size present challenges for operationalization.

For SLMs, lightweight embedding augmentations work well. Using separate, faster encoders for retrieval versus generation improves overall workflow efficiency. Quantization based approximate search also balances accuracy and performance.

Reranker Innovations

While many RAG systems show strong results using just a single retriever, cascade architectures with rerankers offer flexibility to tradeoff between accuracy, latency, and cost. Rerankers rescore initial retrieval results and focus on high precision passages most useful for final generation.

Cross-Encoders Standard dual-encoder retrievers lack capacity to deeply model query-passage interactions. Cross-encoders like ColBERT explicitly encode the concatenation of context with each passage to learn richer relevance patterns.

Large transformer LM rerankers show strong gains but require encoding every query-passage pair independently. Poly-encoders improve efficiency by sharing computation across passages using query condition vectors.

For LLMs, full cross-encoders maximize accuracy but at a high computational cost. Poly-encoder efficiency improvements help but still require large models Encode Tokenize Encode designs encode passages just once during indexing then score encoded query against precompiled indexes.

For more efficient SLMs, one successful pattern is to use a large language model for the initial retriever then Poly-Encoder reranker with 3–10x smaller model size. This provides good accuracy-efficiency balance.

Weak Supervision Scaling

Cross-encoder style exhaustive search has high compute demands. Weakly supervised ranking losses allow efficiently training models that score query-passage compatibility with just a single forward pass.

At scale, self-supervised pretraining from contextualized term replacement helps further bootstrap relevance models. Pretrained models can also rapidly adapt to new domains given just a few domain-specific examples.

For LLMs, pretraining provides limited gains since supervised fine-tuning is already heavily optimized. Gains come from architecture adjustments like using softmax temperature to calibrate uncertainty.

For SLMs, pretraining provides more significant improvements in accuracy and sample efficiency. It also enables use of more efficient architectures tailored for scoring not generation.

Specialized Reranker Architectures

Beyond adjustments to model size and pretraining, specialized architectures also improve reranker efficiency.

For example, predictor-estimator models use a small neural network to predict relevance labels. The predictions are fed into a lightweight logistic regression estimator for well-calibrated scores. By limiting full crossover attention to just the predictor, overall computation reduces greatly while retaining strong relevance estimates.

For LLMs, the extreme approach uses the generator LLM itself as a ranker. Accuracy maximizes but at an extreme computational cost completely erasing efficiency gains of cascade architectures.

For SLMs, specialized efficient ranker architectures work well. The key is avoiding standard LLM transformers in favor of lightweight, dropout-free models with specialized self-attention pooling. These create the best balance between accuracy and high throughput.

Innovations in Generator Models

The generator module ingests the context alongside relevant retrieved passages and produces output text augmented with external knowledge.

Fusion methods determine how to combine and present retrieved evidence, while conditioning techniques allow integrating these fused inputs into the generative process. Architectural innovations also continue advancing integration effectiveness and efficiency.

Evidence Fusion

At fusion time, decisions include:

  1. How many passages to retain
  2. How much of each passage to extract
  3. Whether to concatenate passages or present separately
  4. How to weight or rank different passages

For LLMs, accuracy focuses fusion approaches. All retrieved content gets included to maximize potential evidence. Full passages bring risks of hallucinated facts so truncation helps, though it discards potentially useful context.

For SLMs, efficiency is more important. Rigorous distillation into just a few sentences ensures concise, relevant conditioning. Ranking and weighting further enhances quality. Key facts should compile crisply without losing vital retrieved knowledge.

Conditioning Design

During generation, retrieved evidence also needs proper contextual integration. Basic approach is concatenating evidence passages with the input context before encoding.

However, evidence could overwhelm original context or bring redundant information. Advanced solutions improve integration coherence.

For LLMs, working memory architectures show promise. External knowledge encodes separately from context then gets decoded through attention-based memory reads and writes. This avoids overwriting original context states during evidence encoding.

For SLMs, lightweight entity linking provides supplementation without risk of overwriting. Linking context entities to relevant passages enables entity-focused augmentation without disrupting context representation.

Efficiency Optimized Architectures

Beyond fusion and conditioning, overall RAG generator architecture also impacts efficiency. Tradeoffs balance accuracy with throughput and cost.

Encode-Manipulate approaches optimize efficiency by encoding evidence just once during indexing then manipulating representations during generation requests. However, manipulation functions are often simple, limiting expressiveness.

For LLMs, architecture optimizations focus accuracy over efficiency. Chains of Multiple Pretrained Transformers provide strong results but require breaking generation into complex pipelines across multiple model instances.

For SLMs, efficiency becomes vital. Shared normalization architectures with query-key decomposition and conditional query embedding enable single-pass encoding of evidence during generation requests. Weights can also specialize to each operation without parameter explosion.

These architectural innovations maximize speed and cost efficiency while retaining surprisingly strong generation capability augmented by the indexed evidence.

Hybrid RAG with Heterogenous Models

So far I have discussed innovations tailored for either LLMs or SLMs exclusively. However, modern RAG solutions actually integrate mix-and-match components utilizing both large and small models in hybrid architectures.

LLMs maximize accuracy for key stages then pass condensed outputs into more efficient SLMs for subsequent operations. This provides an optimized blend of quality and efficiency.

For example, initial retrieval might leverage LLMs for maximum recall. The most relevant results get reranked by medium-sized models then very top passages feed into specialist SLMs for final integration. Certain SLMs also specialize on specific content forms like long documents vs tables vs lists to maximize integration coherence.

This hybrid approach balances accuracy with throughput. It also optimizes cost by maintaining larger models mostly for offline indexing then heavy throughput computation utilizes efficient models. Specialization for different tasks prevents unnecessary abstraction and overparameterization.

The end result are high-performance RAG solutions delivering strong accuracy and scalability tailored to real-world production use cases — the best of both LLM quality and SLM efficiency.

Key Takeaways

Let’s recap the key lessons on advanced RAG techniques for LLMs and SLMs:

  • RAG complements language models with external knowledge retrievals to improve generation accuracy, relevance, and information coverage
  • Retriever innovations enhance lexical, semantic, and contextual relevance matching signals for both long-form queries and keyword queries
  • Reranker architectures specialize in precision relevance predictions using strategies combining pretraining, model sizes, and network architectures
  • Generator fusion integral external evidence smoothly using truncation, distillation, weighting, working memory, and entity grounding techniques
  • Hybrid RAG systems blend both LLMs maximizing quality along with efficient SLMs for scalability and throughput

I discussed a variety of techniques across retrieval, ranking, and generation modules — highlighting adaptations for both network scale and architecture.

By combining innovations across query understanding, evidence selection, contextual integration, and output generation, modern RAG delivers extremely strong results unlocking the external knowledge needed to power next-generation applications.

As Both industrial research and academic advances continue progressing rapidly. I hope my analysis has provided a helpful consolidation of the state-of-the-art along with guiding principles for continuing innovation both in advanced LLMs and more efficient SLMs.

Please share any thoughts or questions in the comments! I’m happy to discuss any aspects of RAG techniques and architectures in more detail.

--

--

Bijit Ghosh

CTO | Senior Engineering Leader focused on Cloud Native | AI/ML | DevSecOps