Some Dynamic RAG implementation = Non Hallucinating Fine Tuned Models (?)

8 min readSep 21, 2023

Purpose

In my previous blogs, I have talked about how Retrieval Augmented Generation, or RAG assists, in grabbing information from the context as need be, based on the prompt and responds “only if the answer is given in the context” and if not, it helps the model to not hallucinate and rather be more reliable, right? The purpose of this blog is a short dive into what are the application where we cannot use RAG or where RAG would not be applicable to use, and then introduce the concept of dynamic RAG, with some code walkthroughs. As a pre-requisite to this blog, read my previous blogs on the concepts of RAG on LLMs, Langchain, and other basic generative AI concepts. Now without further ado, let’s get into it.

NOTE: I work at AWS, but the thoughts and ideas on these blogs are my own.

Usually in the case of different LLM use cases we have used we have a couple of options. Firstly, we can use these structures to shape these tasks. One example would be:

Using an agent to perform a certain task — this could be QA, summarizations, chain of thought, conversational tasks.
We could use a pipeline, right? We could use this to represent complex flows and scenarios to run in parallel. This could be for example, for a start up that uses AI powered generative applications to perform several different tasks, for example, a product that helps you learn a language.
Lastly, there are workflows, where we can sort of group these pipelines together for structured, and more organized pieces of LLM functioning together. Let’s take a look at some examples.

Before going into the nitty gritty, a quick summary on RAG — it enables developers to retrieve and vectorize data, augment the LLM query that the user inputs. Using this technique take a look at this solution architecture below:

Here, there is some sort of query, so we first pull the data from the data source, we generate embeddings — these are ways to represent textual data by generating vectors. These vectors then go into the vector store, and then we use it to get the relevant documents with user queries, by storing these texts into chunks. Now once this is done, the query pulls out only the most relevant information from these relevant chunks that it creates from these vectors and embeddings. Pretty fascinating right? I think this is just the prototyping phase, and once we find ways to reduce the amount of memory and compute used, we can go to further limits with generative AI applications, and really have human aligned ‘bots’. I studied a double major of computer science and cognitive psychology and knowing how important linguistics is in human understanding, perception, and generation, having something like this in the realm of generative AI, is definitely a step up, but there is so much more we can do with it all. I will save this all for later, let’s move on. But here, we are essentially augmenting the query process.

Is RAG a prototyping phase? What is wrong with it?

Firstly, there is nothing wrong with RAG, but yes, i agree that it is only at the prototyping stage. It is essential to mention three main aspects that are on going challenges with RAG, and how we can solve it using ~Dynamic RAG~.

Once we have this data from any sort of context, and we embed it, vectorize it and store it in chunks, what is the first thing that comes into play? I think about what is the “memory” that it can take to work efficiently, and how can we keep processing this data again and again. There is a cost associated with this, which is inevitable. But, something to consider in this aspect is that even though the cost aspect is going to pretty much be the same, how can we optimize this process?
Storing this data, using it and referring to whatever knowledge base your company might have it limited, so memory comes into place too. It is interesting you know — I have thought of having my knowledge base is something like an Amazon S3 bucket, we could just set lifecycle policies on it and then retrieve data from that — that would be pretty efficient right? You could offload the historical data, and then delete it automatically using lifecycle policies and then also not consume all of the knowledge base memory. Anyways, I am tending to touch upon dynamic RAG already, before that, let’s look at the third and the obvious challenge:
Data is not ALWAYS new: Obviously, as you might have guessed, we can take an example of a start up here. Say a company ‘XYZ’ is hosting a web server with some information where they need to make personalization to users using RAG. You need access to customer’s clickstream data for this use, but you don’t need the data all the time. it is eventually going to be outdated, and for this, we introduce ~Dynamic RAG~.

Regardless of which approach you use, either of them is going to take up on memory and cost. In the case of RAG, your business is always changing, the market is always growing, so we need a dynamic solution, and not a static approach. This is where dynamic RAG comes into play:

Dynamic RAG: Grow your dynamic business needs

Before moving forward with some code walkthroughs. Here is what Dynamic RAG does to solve the challenges above:

Let’s take an example of a RAG bot that helps you learn a language through a website or a large knowledge base with linguistic information that is always changing. In this scenario, you cannot use RAG to help you because it is static, the memory and the compute power is limited and the information on the website is always changing. For this use case, Dynamic RAG works by vectorizing and caching the data on that website in real time. This way the information on the linguistics website will use Chain of Thought to make this a possibility. Only models that are highly trained and large can do this, for example, Anthropic Claude and GPT.

Some quick words on how GPT-4/Anthropic Claude assists in chain of thought to make dynamic RAG a possibility:

Chain of Thought: “chain-of-thought (CoT) prompting [1] is a recently-proposed technique that improves LLM performance on reasoning-based tasks via few-shot learning. Similar to standard prompting techniques, CoT prompting inserts several example solutions to reasoning problems into the LLM’s prompt”

Dynamic RAG works with dynamic data right? I talked a little about S3, so let’s take an example of that being our knowledge base, but with limited capacity — so in this case, we want to be able to only use small amounts of ever changing data, so dynamic RAG can work with this data and give off responses that are updated and can be used, while also not taking up on too much compute. This would be efficient for our start up idea to help learn languages using a bot because this supports conversational techniques.
Lastly, it enables off prompting patterns and let’s talk more about this.

Architectural Proposal: Dynamic RAG

Usually, in the case of Dynamic RAG, there would be no data pre processing. In this case the client would go to the agent, and prompt using some chain of thought prompts. In this case, there would be some reasoning given in the context that the LLM can use to do something for that specific prompt right. In this case, the chain of thought will use this to execute the program, use the content from the dynamically changing data, stores it in the memory and then in this case you have the memory vectorize this data, and then you get the updated and accurate response.

This changes based on the data. The agent could help in querying, passing it to another model or tool, and you can really build off different pipelines and architectures in the case of this. Point being, you need to implement a solution like this if your business use case is ever changing. In the case of a medical use case, where data remains static for a while, you would not need this.

Code Snippets/Walkthroughs: Dynamic RAG

from decouple import config
from startup import utils
from startuplibrary.structures import Agent
from startuplibrary.tools import webscraper, websearch

web_search_activity = WebSearch(    linguistics_website = config ("API_KEY"),
    linguistics_website_search = config ("API_KEY_SEARCHID"),)web_search_activity = WebScraper()agentDemo = Agent (
    tools = [
        web_search, web_scraper
        
        ]
 )
 
 utils.Chat(agent).start()

Now in this case, we create a start up library for your use case, getting the memory, and then we are importing the web search, scraper and agent that has two tools in it. After this, we run and get some chat and run our bot.

Now in this case, we could ask it to access the linguistics website and then ask questions based on that and get all of the updated information as required. We can use the web search tool here to search the website and get the scraper to get the content of the website.

Here, we can get the results, and returning it back to the LLM (only in the case of which we get less information), and then the LLM uses the web scraper and gets the most relevant information based on this.

Point being, not a lot of companies and start ups are implementing it yet, but it is just a matter of time before dynamic RAG becomes the thing to satisfy your dynamic business use case.

Conclusion

Now that we have discussed this, our next question should be: If we have so much context, and we can use Dynamic RAG to pull out that information and get the relevant results, how much would it replace ‘Fine Tuning a model’?
I think this is dependent from use case to use case, and if you need, we can do a walkthrough of dynamic RAG in the next blog. I think this is all fascinating and the way things are evolving, it is just in sometime that we are able to do tasks efficiently with models that are indeed reliable, non hallucinating and helpful with dynamically changing data.

TBDiscussed: Maybe implementing a bit of dynamic RAG with model fine tuning of large capable models like GPT and claude will remove hallucinations? Let’s see!