All You Need to know about RAG — When, Why, and How to use it?

Published in

Analytics Vidhya

11 min readSep 2, 2024

With the rapid evolution of Large Language Models (LLMs), Retrieval Augmented Generation (RAG) pipelines have quickly become the de-facto framework for chatbots. These LLMs come with impressive abilities, that allows them to play the role of your ever-helpful digital assistant by following instructions straight from a prompt. The result? Their responses are far more cohesive than those generated by their earlier, less sophisticated counterparts.

But of course, no superhero is without their kryptonite. They are trained to be relentlessly helpful. They almost never say “No” — and that’s not always a good thing. You see, the real problem arises when they don’t know that they don’t know.

The Hallucination Problem [2]. LLMs are trained on massive volumes of internet data under a Next-Token prediction (or causal) paradigm. Essentially, their job is to predict the next word that’s most likely to follow, based on all the words that have come before it. Now, this might sound clever — and it is! — but it’s also a bit like trying to write a novel one word at a time, without ever planning ahead! The real kicker is that these models don’t come equipped with a built-in fact-checker. So when they stray off the knowledge map, they’re not aware of it — leading to some confident but completely fabricated responses. In short, they lack the ability to stop, reflect, and correct themselves. It’s a bit like having a friend who insists on giving you directions, even though they’re hopelessly lost themselves!

Hallucination refers to the behavior of how the next tokens (words) predicted by the LLM drifts away from the groundtruth context, resulting in factually irrelevant or wrong outputs.

RAG aims in keeping the LLMs on track. How? By serving up the most relevant documents that actually answer the user’s query. That’s the key reason RAG has gained so much popularity lately. It takes the impressive In-Context Learning (ICL) abilities of LLMs and anchors them to solid, relevant information, ensuring they deliver responses that are not only cohesive but also accurate and tailored to the user’s needs.

One crucial aspect of RAG is its ability to query your own data. Sure, we have Google for searching the web, but would you really feel comfortable sharing your personal documents with Google? And what about a company — can it afford to expose its internal documents on the internet? Of course not! That’s where RAG comes to the rescue. Not only can it search through your private documents, but it can also answer queries and generate summaries tailored to your specific data — all while keeping everything private and secure.

These days, most tutorials barely skim the surface when it comes to the true potential of a fully-fledged RAG framework. Everyone’s in a rush to showcase the power of Generative AI (though let’s be honest — we still have quite a way to go) with as little code as possible. Call four functions from three different libraries and, voilà! Your RAG is up and running! But do you ever wonder what’s actually happening under the hood of those methods from different libraries? And how can we push past these basic RAG systems, which, let’s face it, sometimes perform like a kindergartner trying to search books to explain Newton’s third law of motion? Well, in this blog series — All You Need to Know About RAG — I’ll take you on a deep dive into the inner workings of the most customizable parts of a RAG framework, with a special focus on building an Agentic RAG. Let’s explore and build together!

A RAG pipeline has four core concepts —

Storage
Retrieval
Response Generation
Citations/References

We all get excited about the magic of ‘Response Generation,’ but let’s not forget — it’s just the final product of all the hard work that came before it. As the saying goes, “You reap what you sow!” This couldn’t be more true for an effective RAG framework.

Why is Storage Imporant? Can’t we just dump all at once? What good is using an LLM if we need to focus on storage?

Well, well, well… firing off a bunch of questions at once, huh? Let’s break it down. Picture this: you’re writing some code to analyze millions of rows of data using Python and Pandas. Suddenly, you forget how to perform an “Inner Join” on two DataFrames — happens to the best of us! So, naturally, you Google it (don’t worry, you’re in good company!). Now, imagine that instead of taking you directly to the Pandas merge method, Google throws the entire documentation of the Pandas library at you. Sure, you could scroll through and eventually find what you need, but wouldn’t it be far more helpful if Google just pinpointed the exact page you were looking for?

You know which columns matter for drawing meaningful insights from your data, and Google helps you find the right tool for the job. In the same way, LLMs excel at understanding the context of user queries and answering based on relevant information. But imagine asking them to shift through an entire document instead of just focusing on the most relevant parts! It would be like handing someone a library when all they needed was one book.

With Large Contexts, LLMs generally suffer from “lost in the middle” problem [3]. It might sound ridiculous, but it has been obseved that LLMs, almost like humans, focus mostly on the start and end of a large document, and might lead to missing key concepts present in the middle of the document.

Yes, I heard you screaming — What about LLMs supporting very long context length?

First thing first, The “lost in the middle” problem doesn’t vanish completely — it’s more like it gets a bit of a band-aid. Sure, it’s somewhat mollified, but this can still lead to inconsistencies across different LLMs. And let’s not forget, LLMs that can handle very long contexts tend to be huge (we’re talking 70B+ parameters). Naturally, this comes at the cost of much slower inference speeds. So, it’s a bit of a trade-off — longer context, slower performance. You see, LLMs don’t generate sentences as a whole. They use Transformers Decoder as their backbone. They generate one word/token at a time. Here in the below Fig 1 — Step 1, given the words in orange, ‘Two roads diverged in’, are the input, as the first step, LLM generates — ‘a’, in green, as the output. To generate the next word, in Step 2, it takes the entire text predecessing it as the input — ‘Two roads diverged in a’, and generates — ‘yellow’, and so on. As, every time to generate a new word, it looks at (and processes) all the tokens predecessing it, it takes a Time Complexity of O(n²).

Fig.1. LLM Inference — One step at a time

Secondly, we don’t usually search from just one giant encyclopedia, do we? Instead, we gather information from multiple sources and different types of documents. Even if LLMs get really good at handling super long contexts, do you really want to provide a list of documents for them to search through every single time you ask a question? Probably not. What you’d prefer is to simply add a new document, and the next time you query, the LLM should incorporate that fresh info into its response. And let’s not forget the challenge of staying up-to-date. When it comes to integrating RAG with current events, we need the most recent information available. Unfortunately, LLMs have a built-in limitation — they can only know as much as their pre-trained data, which comes with an inevitable knowledge cut-off date. Oops!

Most of our data is unstructured — raw text, images, htmls, codes, tables in pdfs, etc. Hence we will cover the key concepts of Storage — Types of Chunking, Types of Indexing, and Effective Use of Metadata later in this blog series.

Isn’t Retrieval just semantic similarity between user query and stored documents? What’s more to it?

To answer this question, let’s understand what is semantic similarity with an example:

User Query: When should I go to bank before next weekend? I need urgent money.

Context1: The colorado river is really pretty. There are tons of bars on its
          bank! People say the river walk on its bank is awesome. But its
          closed on Fridays!

Context2: Banks are generally open from Monday to Saturday. However, Next week,
          Tuesday is a bank holiday.

Which one do you think is the most relevant context for the user query? I am sure, most of you agreed on Context2. Well why? Is it because of word/pharse — ‘bank’, ‘before next weekend’, or ‘need urgent money’? Well, I guess we all know. If you look closely, the word ‘bank’ is present two times in Context1 and once in Context2. And the word ‘money’ is not really present in either of the sentences. But its the ‘need for money’ and the structure of sentence in Context2, which makes it clear that the Context2 talks about Bank, a Financial Institute. But how does a LLM know this? — By understanding the contextual similarity between the user query and contexts. A simple keyword search would return Context1 as the better result because of the presence of ‘bank’ two times. However, LLMs use embedding models which are trained on huge corpus of data. And when we pass a text block through these models, we get a vector representation of the text block which captures its context. It uses the position of the words and looks at every word of the text block to identify its context.

But do we always query with sufficient context? What if our query is open ended and indeed need diverse contexts to be answered? What exactly is the best context to answer our query?

Believe it or not, most of the time our queries do not have the entire context. Thats when effective retrieval becomes a key. We can use LLMs to augment the query, and this process is called Query Expansion, to extend the context to better retrieve the document. We can also use the Metadata Filtering to narrow down the search space for the retrieval. Now, if our query is too complex, we can break it down into sub-queries and perform retrieval step-by-step and finally aggregate the contexts to respond — this process is called Query Transformation.

Sometimes, retrieving lesser number of documents can lead to a problem called Recall Ceiling, i.e. the quality of the generation is constrained by the quality and quantity of retrieval. So, why not retrieve a large number of documents? — Well, that would lead to very large contexts if we just provide all of them to the LLM? You see, that again leads us to the lost-in-the-middle problem. They said truely — “The End is the Beginning and the Beginning is the End! — Dark”. Well, We can use a Neural Re-ranker like ColBert to re-rank the large list of retrieved results. And finally, if our query is too open ended, we might also miss key documents if we just find semantic similar documents, so why not also use the traditional keyword retrievers, like BM25? — Well we can and we call that method Ensemble Retrieval. We will discuss all of these key techniques in blogs to follow.

But can we guaranteed that all relevant documents would be retrieved by doing all these?

Unfortunately, No! That’s where Agentic RAG swoops in — cue the drumroll, please! Imagine, you’re searching through multiple books, trying to crack a tough assignment. You dive into the first book, find the relevant topic, and read through it eagerly… but nope, no solution there. So, you move on to the second book. You skim through the familiar sections and check out a few new ones, but still no luck. Determined, you reach for the third book, and voilà! You finally find the approach you need!Now, did you notice something? After each book, you made a decision to move on to the next one because the answer wasn’t in the current book.

That’s exactly what Agentic RAG mimics — it recognizes when the current context isn’t sufficient to answer a query and knows when to dig deeper and fetch more context.

Pretty smart, right? There are frameworks like Self-RAG that enable this kind of agentic behavior. But let’s not get ahead of ourselves — Self-RAG deserves a tutorial all on its own, which I’ll be covering in future blogs.

Wait, That’s a lot of theory! Just tell me quickly — Why do we need Citations? Are we not doing all these other stuffs already ?

Alright, let’s get straight to the good stuff — if you handle Storage and Retrieval efficiently, citations won’t be a headache! But yes, we absolutely need citations. In fact, citations are probably the single most important element in building a system that’s faithful and trustworthy. (If only we could design humans this way too!)

Citations not only provide users with the exact sources from which the responses are generated, but they also help tackle the Hallucination Problem in LLMs. By forcing the model to cite its sources, we create effective guardrails that keep it from wandering too far off track. And don’t worry, the proper use of citations ties into all the other components of RAG, which we’ll be diving into more in upcoming blogs.

Hey! Are you cheating? You just missed the most important part — “Response Generation”. Did you do that on purpose?

Yes, It might be the most important part as a final product, but as I stated earlier — “You reap what you sow!”, the final Response is ought to be as good as all other components of the RAG framework. And truth be told, given a set of contexts, and a query, LLMs are already very good, sometimes too good, to answer the query in the most coherent manner possible. So, we will see how different RAG frameworks can generate different responses and how we can incrementally improve our final response.

We can broadly classify the RAG frameworks into the following three categories —

Fig.2. (left) Iterative retrieval involves alternating between retrieval and generation, allowing for richer and more targeted context from the knowledge base at each step. (Middle) Recursive retrieval involves gradually refining the user query and breaking down the problem into sub-problems, then continuously solving complex problems through retrieval and generation. (Right) Adaptive retrieval focuses on enabling the RAG system to autonomously determine whether external knowledge retrieval is necessary and when to stop retrieval and generation [1]

We’ll be diving into all these RAG frameworks — with code! — in our upcoming blogs. I know this has been a long, theory-heavy post. Trust me, I wanted to jump straight into the code too! But writing all of this turned out to be way harder than just whipping up some code examples. Still, I realized that understanding when and why to build an advanced RAG system is crucial before diving into the how.

So, if you’ve scrolled all the way down looking for code and are a bit miffed that it’s not here yet, I encourage you to head back and check out the “When” and “Why” first. If you have read the post, Thank you! I hope you have understood why RAG is here to stay and why we should enhance the basic RAG. I promise, the “How” is coming soon — and I’ll be bringing the code with me! Until then, keep learning, keep growing, and maybe give me a follow if you enjoyed the content and want to know the “How” inside and out!

References

[1] A Survey on RAG Meeting LLMs

[2] A Survey on Hallucinations in LLMs

[3] Lost in the Middle

All You Need to know about RAG — When, Why, and How to use it?

Why is Storage Imporant? Can’t we just dump all at once? What good is using an LLM if we need to focus on storage?

Isn’t Retrieval just semantic similarity between user query and stored documents? What’s more to it?

Wait, That’s a lot of theory! Just tell me quickly — Why do we need Citations? Are we not doing all these other stuffs already ?

Hey! Are you cheating? You just missed the most important part — “Response Generation”. Did you do that on purpose?

Written by Millennium Bismay