An Overview of RAG

4 min readDec 25, 2023

Retrieval augmented generation is an effective way to integrate an Enterprise’s structured and unstructured data with the LLM. RAG enables a Large Language Model (LLM) to generate responses that are specific to the enterprise and have context of its knowledge.

Large Language Models are trained on public corpus

A Large language model is pretrained on a corpus of texts sourced from publicly available information on internet. For example, Common crawl is an open-source repository of web crawl data that is used extensively along with other publicly available resources. Once pretrained, a LLM goes through fine tuning that involves Supervised Fine Tuning and RLHF coupled with Safety training to become usable for chat applications, summarization, content generation etc.

Utilizing an enterprise’s vast knowledge base along with a fine-tuned LLM to generate responses that are enterprise specific is emerging as an important use case. The diagram below shows an architecture for using context aware LLMs in an enterprise setting.

Retrieval augmented generation enables use of Enterprise data

RAG enables use of structured and unstructured enterprise data. The whole process can be divided into 3 steps —

Transform the data to make it usable along with a LLM
Retrieve the relevant data chunks based on the query supplied by the user
Send the retrieved chunks and the query to a LLM to generate a relevant response

Transform the data to make it usable along with a LLM

In an enterprise, most of its knowledge base contains documents such as pdf, word documents, markdown files, HTML files and so on. All such documents need to be split into smaller chunks, embedded and stored in a vector database.

Splitting the documents

First step is to load the data and split these documents in smaller chunks. These smaller chunks are searched to retrieve relevant information as per the query. One important thing is to store a document’s meta data along with these chunks for enabling better contextual retrieval at a later stage. All these chunks are overlapped to have a consistent context from one chunk to another.

Embedding

Once a document is split into smaller chunks, these chunks need to be embedded. Embedding is a process of creating the numerical representation of a chunk. These numerical representations are called Vectors. Cosine similarity is used to retrieve vectors similar to the user query. There are many embedding models available that can be used to embed these chunks.

Vector DB

So far, we have broken text into smaller chunks and embedded. We need to store them for eventual retrieval. A vector database is used to store these embeddings. A vector data base uses k nearest neighbor and related algorithms to efficiently retrieve relevant vectors. Some of the vector database providers are Chroma, Pinecone, Weaviate, Elastic Search, and redis.

Retrieval

Once documents have been transformed and stored in a vector database, we need to retrieve them. A query is passed to the vector data base and matching chunks are retrieved.

Similarity Search and Maximum Marginal Relevance Search

There are two ways for retrieval, similarity search and maximum marginal retrieval search. Similarity search will get document chunks that are the most similar. Maximum Marginal Relevance Search retrieves diverse chunks and returns chunks that are not similar to the query but still relevant. A parameter defines numbers of chunks that will be retrieved and later `k` documents will be selected from the top and as well from the bottom most.

Use of metadata in retrieving a relevant document

At times, there will be queries that will have question around a specific document and a better retrieval will happen if the metadata context can be provide while querying. Adding a filter to the query that defines what source needs to be looked at is one of doing that. Using a LLM to retrieve the context from the query and then pass that along with query is also used for an efficient use of metadata.

Question and Answer

Now any query posted by a user is passed with retrieved context to the LLM. LLM makes use of supplied information in its response. A prompt template can be used to instruct LLM to avoid hallucination such as “You are a helpful assistance and if you do not the answer, respond by saying that you do not know the answer.

Step 3 — Pass the query along with retrived chunks to a LLM

Adding Memory for conversation

An important aspect of a conversation is to remember previous questions and answers and continue to draw a context from them. A simple question and answer format does not provide a conversational experience. A memory function is added to keep track of previous questions thus providing a conversation experience.

Summary

Retrieval augmented generation has become quite an important technique to work with LLMs and is growing fast. Newer ways to embed information and evaluating responses are being introduced. Above notes are based on based on “LangChain: Chat with Your Data” available at Deeplearning.ai