How Does RAG(Retrieval Augmented Generation) Works?

Abhinav Pandey
3 min readFeb 21, 2024

--

Large language models (LLMs), also known as foundational models (FMs), undergo training utilizing a vast array of publicly available data. This training involves adjusting the underlying parameters (weights and biases) within the neural network to minimize disparities (or losses) between the model’s predictions and the actual training examples. Subsequently, these models can be fine-tuned for enhanced performance on specific tasks through additional downstream training on new examples. This adaptability equips them to address a wide range of general inquiries and offer generalized responses. For instance, the Anthropic Claude model has been trained on a plethora of sources including web texts, books, Wikipedia articles, and other resources amounting to nearly 1 trillion tokens. Similarly, OpenAI’s ChatGPT incorporates a comprehensive array of publicly available data in its training corpora. The substantial volume of training data empowers LLMs to address almost any query that falls within the scope of their training material.

However, the training of LLMs is a costly endeavor, amounting to millions of dollars each time, which renders continuous fine-tuning or retraining infeasible for every new piece of information. This training process establishes a form of parametric memory for LLMs, as the parameters of the underlying neural network are adjusted during this phase. The parametric memory of LLMs remains static unless the model undergoes retraining. Consequently, if more recent data is required to answer a specific question, the LLM may either provide an inaccurate response or fabricate a non-existent answer, which may appear plausible but lacks authenticity.

1.1 Contextual Data

An alternative method avoids fine-tuning the model and leaves the model’s weights unaltered, is to use specific training examples that are provided as inputs into the model during the inference stage as prompts. This technique is called retrieval augmented generation (RAG).

Contextual data for foundational model (FM) applications encompasses text documents related to products and services that are sufficiently distinctive to uniquely identify them, as well as annotated API data or structured formats such as CSV, JSON, etc. The objective of the contextual data pipeline is to provide the underlying FM with the capability to utilize relevant contextual data as part of its input prompt, aiding it in answering specific knowledge questions. To achieve this, the system utilizes vector databases to store all necessary contextual data required by the FM. Acting as an intermediary between the user/application and the FM, the vector database is accessed whenever data is needed for context.

A context agent utilizes an embedding model to encode input contextual data into vectors that capture the meaning and context of an asset. This enables applications to locate similar assets by searching for neighboring data points. The vector database efficiently stores, compares, and retrieves up to billions of embeddings (i.e., vectors).

LLMs cannot undergo frequent retraining with new data, limiting their ability to respond to specific knowledge-based inquiries. Additionally, LLMs are widely accessed models, making it imperative to avoid incorporating proprietary or confidential information into their training to prevent potential loss or leakage of sensitive data. Both of these challenges can be addressed through the implementation of retrieval-augmented generation.

In this approach, specialized or contextual information is initially stored in a vector database by utilizing an embedding model to convert it into vectors. This process can occur on a daily or nightly basis to accommodate the influx of new data within the organization.

When querying the LLM using a natural language sentence, the query text undergoes vectorization using the same embedding model. The resulting vector is then employed to locate the k most relevant documents (k-nearest neighbors) from the vector database. These retrieved documents are subsequently enhanced with the original query and integrated into the prompt sent to the LLM. Leveraging the augmented contextual data, the LLM formulates a response to the query, which is then relayed back to the orchestrator. Finally, the orchestrator furnishes the answer to the querying chatbot.

--

--

Abhinav Pandey

Salesforce Architect @AWS | AWS & Physics Enthusiast | Quantum Physics