How does Retrieval-Augmented Generation (RAG) Work?

6 min readMay 4, 2024

Have you ever typed a question into a search engine, only to be greeted by a nonsensical answer that leaves you more confused than before? We’ve all been there. Large language models (LLMs) are like powerful cars — impressive technology, but prone to taking unexpected turns, especially on complex roads.

Introducing RAG: Your LLM’s Copilot to Knowledge!

RAG, or Retrieval-Augmented Generation, is the ultimate upgrade for your LLM. It acts as a built-in knowledge detective, ensuring your questions are answered with reliable and informative responses. No more wandering through the vast wilderness of the internet — RAG gets you straight to the answers you need.

So, how does this dynamic duo work?

The Retriever: Your Information Bloodhound

Imagine you walk into a giant library with endless shelves, but you have no idea where to start looking for your answer. That’s where the RAG retriever comes in! It acts like a super-powered librarian, meticulously organizing information (articles, books, websites) into a special database for quick and easy searching.

Indexing

The process of organizing information is called indexing. It is similar to creating a detailed map of a library. This map helps the person searching for information to locate the most relevant sections to their query easily. During the indexing process, data is organized and stored in a vector database to make it easily searchable. This enables the system to access relevant information when responding to a question. Here's how it works:

As shown in the image above, here’s the process:

Start with a loader that gathers documents containing your data. These documents could be anything from articles and books to web pages and social media posts.
Next, a splitter divides the documents into smaller chunks, typically sentences or paragraphs.
This is because RAG models work better with smaller pieces of text. In the diagram, these are document snippets.
Each text chunk is then fed into an embedding machine. This machine uses complex algorithms to convert the text into vector embeddings.

All generated vector embeddings are stored in an indexed knowledge base, enabling efficient retrieval of similar information.

Query vectorization

Once you have vectorized your knowledge base you can do the same to the user query. When the model sees a new query, it uses the same preprocessing and embedding techniques. This ensures that the query vector is compatible with the document vectors in the index.

Retrieval of relevant documents

When the system needs to find the most relevant documents or passages to answer a query, it utilizes vector similarity techniques. Vector similarity is a fundamental concept in machine learning and natural language processing (NLP) that quantifies the resemblance between vectors, which are mathematical representations of data points.

The system can employ different vector similarity strategies depending on the type of vectors used to represent the data:

Sparse vector representations

A sparse vector is characterized by a high dimensionality, with most of its elements being zero.

The classic approach is keyword search, which scans documents for the exact words or phrases in the query. The search creates sparse vector representations of documents by counting word occurrences and inversely weighting common words. Queries with rarer words get prioritized.

TF-IDF (Term Frequency-Inverse Document Frequency) and BM25 are two classic related algorithms. They’re simple and computationally efficient. However, they can struggle with synonyms and don’t always capture semantic similarities.

Dense vector embeddings

This approach uses large language models like BERT to encode the query and passages into dense vector embeddings. These models are compact numerical representations that capture semantic meaning. Vector databases like Qdrant store these embeddings, allowing retrieval based on semantic similarity rather than just keywords using distance metrics like cosine similarity.

This allows the retriever to match based on semantic understanding rather than just keywords. So if I ask about “compounds that cause BO,” it can retrieve relevant info about “molecules that create body odour” even if those exact words weren’t used. We explain more about it in our What are Vector Embeddings article.

Hybrid search

However, neither keyword search nor vector search are always perfect. Keyword search may miss relevant information expressed differently, while vector search can sometimes struggle with specificity or neglect important statistical word patterns. Hybrid methods aim to combine the strengths of different techniques.

Some common hybrid approaches include:

Using keyword search to get an initial set of candidate documents. Next, the documents are re-ranked/re-scored using semantic vector representations.
Starting with semantic vectors to find generally topically relevant documents. Next, the documents are filtered/re-ranked e based on keyword matches or other metadata.
Considering both semantic vector closeness and statistical keyword patterns/weights in a combined scoring model.
Having multiple stages was different techniques. One example: start with an initial keyword retrieval, followed by semantic re-ranking, then a final re-ranking using even more complex models.

When you combine the powers of different search methods in a complementary way, you can provide higher quality, more comprehensive results.

Meet the Generator: Weaving the Answer’s Tapestry

With the top relevant passages retrieved, it’s now the generator’s job to produce a final answer by synthesizing and expressing that information in natural language.🤔📝

The LLM is typically a model like GPT, BART or T5, trained on massive datasets to understand and generate human-like text. It now takes not only the query (or question) as input but also the relevant documents or passages that the retriever identified as potentially containing the answer to generate its response.🔍💡

The image below illustrates how the retrieval output feeds into the generator to produce the final generated response.

In conclusion, RAG (Retrieval-Augmented Generation) is a 🚀 revolutionary technology that combines the strengths of large language models (LLMs) with the power of information retrieval. By doing so, RAG offers a new level of accuracy and reliability in text generation. Think of it as having a 🧑‍🔬 research assistant and a 🗣️ language expert working together to provide you with informative and trustworthy answers to your questions. RAG bridges the gap between the vast knowledge available and your specific needs, ensuring that you receive the most relevant and helpful responses. So, the next time you have a question, consider the power of RAG as it might just be the 🔑 key to unlocking the 🧠 knowledge you seek.

🚀 RAG (Retrieval-Augmented Generation) combines the strengths of LLMs with the power of information retrieval to offer a new level of accuracy and reliability in text generation. It’s like having a research assistant and a language expert working together! 🔑 #AI #NLP #RAG #textgeneration

How does Retrieval-Augmented Generation (RAG) Work?

Introducing RAG: Your LLM’s Copilot to Knowledge!

So, how does this dynamic duo work?

Meet the Generator: Weaving the Answer’s Tapestry

Written by Nikita Anand