What is RAG (Retrieval Augmented Generation) ?

4 min readMar 2, 2024

Before we dive into RAG, let’s understand what a large language model is ?

If you have ever used Chat-GPT, or Gemini, then you already know that these applications enable a user to chat with a complex deep learning model that answers any question asked by the user.

This complex deep learning model is the large language model that is trained on a wide range of information present on the internet which enables these models to answer any question.

Now consider this situation, you have a research paper to read and you are not patient to go through the entire paper. Therefore, you just need a summary of the paper.

So what do you do?
You copy the entire paper and paste it in the chat. And ask the LLM to summarize the entire paper. The LLM uses the research paper content as context and performs the summarization task and gives you a gist of what the research paper is about.

Now that you have a brief idea about Large Language Models, Let’s dive into the concept of RAG. If you would like to know more about Large Language Models, check out this video-> link

What is RAG ?

Retrieval Augmented Generation or RAG is an architectural approach to improve the performance of LLMs (Large Language Models) by leveraging custom external data. This approach involves retrieving data/documents and providing more context to a question or task asked by user who is querying the LLM. RAG enables the LLM to maintain up-to-date information or access domain-specific knowledge.

What is the challenge that RAG solves ?

LLMs are not up-to-date
LLMs are trained on a wide range of public data so it has the capability to respond to many types of questions or tasks. But there is one fault here, the knowledge that LLM is very limited and doesn’t constantly update itself unless the LLM is retrained with new data.

For example (This just for explaining the point)
Consider a scenario where the LLM was only trained up to 2022 data.
In 2022, Jupiter was considered to be the planet with the most moons.
In 2023, Saturn became the planet with the most moons
Now, if a user asks the LLM which planet has the most moons in our solar system ? The LLM would answer Jupiter. And this is wrong. This issue is solved by RAG.

How does RAG solve the problem ?

Keeping the above example in mind, the LLM on its own doesn’t know if the Jupiter is the right answer. All its decisions are probabilistic decisions on the amount of data it had been trained on. The answer with the highest probability is the right answer based on the context provided by the user according to the LLM.

Now to make sure that LLM answers Saturn as the answer is done by the following steps :

Prepare the data/documents that teaches the LLM about the new changes in the solar system. To be used in RAG applications, documents need to be chunked into appropriate lengths based on the choice of embedding model and the downstream LLM application that uses these documents as context.
Convert all these document into embeddings and store them in an vector database.
When a user now asks the question — Which planet has the most number of moons in our solar system?
Before querying the LLM with the question, relevant data related to the question is extracted from the vector database and passed to the LLM as context along with the question.
LLM uses the new context and answers Saturn. Yay!

This way you can tackle the problem of out-of-date large language models. And make sure you get the right answers without second guessing LLMs output.

Architecture

What are the benefits of RAG?

The RAG approach has a number of key benefits, including:

Providing up-to-date and accurate responses: RAG ensures that the response of an LLM is not based solely on static, stale training data. Rather, the model uses up-to-date external data sources to provide responses.
Reducing inaccurate responses, or hallucinations: By grounding the LLM model’s output on relevant, external knowledge, RAG attempts to mitigate the risk of responding with incorrect or fabricated information (also known as hallucinations). Outputs can include citations of original sources, allowing human verification.
Providing domain-specific, relevant responses: Using RAG, the LLM will be able to provide contextually relevant responses tailored to an organization’s proprietary or domain-specific data.
Being efficient and cost-effective: Compared to other approaches to customizing LLMs with domain-specific data, RAG is simple and cost-effective. Organizations can deploy RAG without needing to customize the model. This is especially beneficial when models need to be updated frequently with new data.

When should I use RAG and when should I fine-tune the model?

RAG is the right place to start, being easy and possibly entirely sufficient for some use cases. Fine-tuning is most appropriate in a different situation, when one wants the LLM’s behavior to change, or to learn a different “language.” These are not mutually exclusive. As a future step, it’s possible to consider fine-tuning a model to better understand domain language and the desired output form — and also use RAG to improve the quality and relevance of the response.

I tried my best to explain the concept of RAG in the most easy way as possible. If you want to know more about RAG and LLMs. Watch the below video

(137) What is Retrieval-Augmented Generation (RAG)? — YouTube