Starting to Learn Agentic RAG

Sandeep Shah
10 min readJul 25, 2024

--

SHORT SUMMARY—

I don’t typically write coding tutorials, but I enjoy sharing small discoveries and snippets of code that you can build upon. So, consider this more of a blog or article with some code snippets. While learning agentic RAG, I encountered challenges with advanced strategies because one function only worked with Llama 70B. After a lot of trial and error, I managed to get the function working with Llama 8B, and this post is all about that. This post demonstrates how taking extra steps, like updating the PromptTemplate (a form of prompt engineering), can help us solve use cases with smaller models. I won’t delve deeply into the accuracy of the results, as my primary focus was getting the code to run as shown in the tutorial.
I also attempted to compare Llama2–7B and Llama3–8B, but the results were not always reproducible. Therefore, in this post, I am focusing on Llama3–8B and 70B.

Code — https://github.com/SandyShah/llama_index_experiments/blob/main/ollama_try_post.ipynb

MAIN ARTICLE

Link to the code/tutorial that I am trying to run on my setup — link

It has been months since I last did any substantial learning on LLMs. One of the major reasons was the LlamaIndex upgrade to version 0.10, which I struggled to install on my system. Finally, after a lot of procrastination and reinstallation, I managed to get the new version working and immediately wanted to start with agents, especially agents for RAG.

I have said it many times, and I repeat: there are many tutorials out there, but most of them use OpenAI APIs or other paid versions. Some tutorials do use open-source or free APIs, but that requires sending data outside and getting a response back. If you’ve been following my other posts, you know I like to keep things as local as possible.

Deep Learning and LlamaIndex have released good resource materials on open-source agents for RAG, which I have been exploring. Additionally, I made another change: I used to load GGUF models using LlamaCpp, but now I use the Ollama library to make direct calls. This switch has reduced the effort required to set up LLMs on my system and manage local copies of the models.

I thought I would use the tutorials and Ollama to create something and post it, but then I stumbled upon some ready-made code that perfectly suited my needs. While running this, I got stuck on the SubQuestionQuery router or engine. I was initially trying to make the code work as it was before modifying it for my own documents, and that’s where I hit a roadblock. The advanced strategies worked with Llama70B — a pretty hefty model. In fact, the tutorial even mentioned that it was designed for 70B. This was a concern for me, and when I tried Llama3–8B, it failed. I eventually figured out a way to make it work by modifying the prompt to match the Llama3 template. This post is all about that one small modification I made.

CODE WALK THROUGH

Firs part involves loading the model and just do a basic query to make sure I have everything working fine.

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

llm = Ollama(model="llama3:8b-instruct-q5_K_M", request_timeout=120.0)

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Settings.llm = llm
Settings.embed_model = embed_model

response = llm.complete("do you like drake or kendrick better?")

print(response)



-- OUTPUT --

I'm just an AI, I don't have personal preferences or opinions, but I can tell you that both Drake and Kendrick Lamar are highly acclaimed artists with their own unique styles and contributions to the music industry.

Drake is known for his melodic flow, introspective lyrics, and ability to blend hip-hop with R&B and pop. He has had a string of successful albums and singles, including "God's Plan," "One Dance," and "In My Feelings."

Kendrick Lamar, on the other hand, is celebrated for his socially conscious lyrics, storytelling ability, and genre-bending sound, which blends hip-hop with jazz, funk, and spoken word. He has released critically acclaimed albums like "Good Kid, M.A.A.D City," "To Pimp a Butterfly," and "DAMN."

Both artists have been praised by critics and fans alike for their innovative approach to music and their ability to push the boundaries of hip-hop as an art form.

Ultimately, whether you prefer Drake or Kendrick depends on your personal taste in music. If you enjoy melodic, introspective hip-hop with a focus on relationships and personal growth, you might prefer Drake. If you appreciate socially conscious, lyrically dense hip-hop that tackles complex themes like racism, inequality, and self-discovery, you might prefer Kendrick.

But remember, both artists are highly talented and have made significant contributions to the music industry. You can't go wrong with either one!
  • Getting the Data and Loading in the current session
# Sharing the links to get the data - this is as per the tutorial I followed.
# !mkdir data
# !wget "https://www.dropbox.com/scl/fi/t1soxfjdp0v44an6sdymd/drake_kendrick_beef.pdf?rlkey=u9546ymb7fj8lk2v64r6p5r5k&st=wjzzrgil&dl=1" -O data/drake_kendrick_beef.pdf
# !wget "https://www.dropbox.com/scl/fi/nts3n64s6kymner2jppd6/drake.pdf?rlkey=hksirpqwzlzqoejn55zemk6ld&st=mohyfyh4&dl=1" -O data/drake.pdf
# !wget "https://www.dropbox.com/scl/fi/8ax2vnoebhmy44bes2n1d/kendrick.pdf?rlkey=fhxvn94t5amdqcv9vshifd3hj&st=dxdtytn6&dl=1" -O data/kendrick.pdf


from llama_index.core import SimpleDirectoryReader

docs_kendrick = SimpleDirectoryReader(input_files=["data/kendrick.pdf"]).load_data()
docs_drake = SimpleDirectoryReader(input_files=["data/drake.pdf"]).load_data()
docs_both = SimpleDirectoryReader(input_files=["data/drake_kendrick_beef.pdf"]).load_data()
  • Creating Vector Indexes
    These are some basic steps in RAG flow. You read documents, chunk and/or embed it and then do a search over it. You can also go through some of my other articles (link at the end) to know about meta data filters on RAG, query augmentation etc. There are other tutorials and blogs to learn about basics too.

from llama_index.core import VectorStoreIndex

drake_index = VectorStoreIndex.from_documents(docs_drake)
drake_query_engine = drake_index.as_query_engine(similarity_top_k=3)

kendrick_index = VectorStoreIndex.from_documents(docs_kendrick)
kendrick_query_engine = kendrick_index.as_query_engine(similarity_top_k=3)
  • Tool Creations
    Here — we are just creating index or those documents as tools. Again — you can go through the function calling to see what Tools or agents are and can be and different ways to integrate in your workflow. In this post I assume you have some idea about agents and I proceed further.
from llama_index.core.tools import QueryEngineTool, ToolMetadata

drake_tool = QueryEngineTool(
drake_index.as_query_engine(),
metadata=ToolMetadata(
name="drake_search",
description="Useful for searching over Drake's life.",
),
)

kendrick_tool = QueryEngineTool(
kendrick_index.as_query_engine(),
metadata=ToolMetadata(
name="kendrick_summary",
description="Useful for searching over Kendrick's life.",
),
)
  • MAIN QUERY ENGINE CREATION
    This is the crux of our current post. Here the SubQuestionQueryEngine creates more questions related to our query and then helps to retrieve relevant documents better and answer our main question.

Source — https://docs.llamaindex.ai/en/stable/examples/query_engine/sub_question_query_engine/
It first breaks down the complex query into sub questions for each relevant data source, then gather all the intermediate responses and synthesizes a final response.

we provide llm as an argument and we will try different llms and see the response — here I am not going into details of accuracy but I will focus on the fact the the QueryEngine executes or it fails.

from llama_index.core.query_engine import SubQuestionQueryEngine

query_engine = SubQuestionQueryEngine.from_defaults(
[drake_tool, kendrick_tool],
llm=llm,
verbose=True,
)
  • I would like to directly jump into the query)engine. Let us see how it looks.
[print(i) for i in query_engine.get_prompts().keys()]


-- OUTPUT --
question_gen:question_gen_prompt
response_synthesizer:text_qa_template
response_synthesizer:refine_template

So — the query engine has three prompts — let us glance at them once. This will make the below part of blog clumsy — but we need to go through it.

print(query_engine.get_prompts()['question_gen:question_gen_prompt'].template)


-- OUTPUT --
Given a user question, and a list of tools, output a list of relevant sub-questions in json markdown that when composed can help answer the full user question:

# Example 1
<Tools>
```json
{{
"uber_10k": "Provides information about Uber financials for year 2021",
"lyft_10k": "Provides information about Lyft financials for year 2021"
}}
```

<User Question>
Compare and contrast the revenue growth and EBITDA of Uber and Lyft for year 2021


<Output>
```json
{{
"items": [
{{
"sub_question": "What is the revenue growth of Uber",
"tool_name": "uber_10k"
}},
{{
"sub_question": "What is the EBITDA of Uber",
"tool_name": "uber_10k"
}},
{{
"sub_question": "What is the revenue growth of Lyft",
"tool_name": "lyft_10k"
}},
{{
"sub_question": "What is the EBITDA of Lyft",
"tool_name": "lyft_10k"
}}
]
}}
```

# Example 2
<Tools>
```json
{tools_str}
```

<User Question>
{query_str}

<Output>

If you observe — we ask LLM in this prompt to generate relevant questions and also with each question provide us with the tool that can be used to answer the question. Here the output format is critical as this output is then passed on to the next part of query engine and all this is automatic or behind the screens.

Example output format for our use case —

{{
"items": [
{{
"sub_question": "Songs by Kendrick",
"tool_name": "kendrick_tool"
}},
{{
"sub_question": "SOngs by Drake",
"tool_name": "drake_tool"
}}
]
}}
print(query_engine.get_prompts()['response_synthesizer:text_qa_template'].get_template())


-- OUTPUT --
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer:

Above one is SelectorPromptTemplate and I am just showing a part of it and I advise to go in depth yourself if required. In this post I won’t be modifying it and so not going deeper. The last one — this is much clear. We use above to generate an answer and then use the below refine template to further improve our answer.

print(query_engine.get_prompts()['response_synthesizer:refine_template'].get_template())

-- OUTPUT --

The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer (only if needed) with some more context below.
------------
{context_msg}
------------
Given the new context, refine the original answer to better answer the query. If the context isn't useful, return the original answer.
Refined Answer:
  • Example of how to use the above query engine and expected output.

response = query_engine.query("Which albums did Drake release in his career?")
print(response)


-- OUTPUT --

Generated 1 sub questions.
[drake_search] Q: What are the albums released by Drake?
[drake_search] A: Academy, So Far Gone, Care Package, Dark Lane Demo Tapes, Certified Lover Boy.
Academy, So Far Gone, Care Package, Dark Lane Demo Tapes, and Certified Lover Boy.

MAIN CRUX — all the above works as it is if I use the following models —

  1. llama3:70b-instruct-q2_K (approx 26GB)
  2. llama3:8b-instruct-q8_0 (approx 8.5 GB) — only if you have big enough GPU — like 8GB or more.

Now I wanted to try with smaller model — I can’t afford to have 26GB model and that too to answer one query the engine will make three calls —
1. To generate sub questions
2. Then generate an answer
3. Refine the answer

llama3:8b-instruct-q8_0 — this was not so reliable as it used to fail at times on smaller GPU system. So I deep dived into the problem and figured out the prompt template can be modified to suit llama3. Now 70b model is powerful — no doubt and so it can make sense of different format but I wanted to test that if I tweak the inout prompt then can I still run and get some sort of answer. I am a stron believer and supporter of smaller to medium models — I feel we should focus more on getting the input context as good and relevant as possible. Anyway — so I modified the SubQuestion generation prompt. Before that — how did I know that prompt template modification can help — so let us look at the error message and then I show what I did.

I will use even smaller quantized model for this — llama3:8b-instruct-q5_K_M — approx 5.7 GB

  • THIS will FAIL —
llm = Ollama(model='llama3:8b-instruct-q5_K_M', request_timeout=240.0)

Settings.llm = llm

query_engine = SubQuestionQueryEngine.from_defaults(
[drake_tool, kendrick_tool],
llm=llm,
verbose=True,
)

response = query_engine.query("Which albums did Drake release in his career?")

print(response)



-- PART OF THE OUTPUT --

OutputParserException: Got invalid JSON object. Error: Extra data: line 2 column 5 (char 7) expected '<document start>', but found '<scalar>'
in "<unicode string>", line 2, column 5:
output["items"] = []
^. Got JSON string: {}
output["items"] = []

for tool in tools:

I saw it is format error and then I dived deep and got to where I am. I saw the prompt for sub question generation and tried different iterations. Then finally I converted the prompt into llama3 format (llama3 format — here)

Below is the string or prompt in llama3 format. Also — there may be scope to further improve this and format it but for now this works. Here you will also learn how to modify the prompts for any template.

llama_3_prompt_str = (
'''<|begin_of_text|><|start_header_id|>system<|end_header_id|>Given a user question, and a list of tools, output a list of relevant sub-questions in 'json' markdown that when composed can help answer the full user question:
OUTPUT RESULT IN JSON FORMAT AS SHOWN IN EXAMPLE.

# Example 1

[Tools]
```json
{{
"uber_10k": "Provides information about Uber financials for year 2021",
"lyft_10k": "Provides information about Lyft financials for year 2021"
}}
```

[User Question]
Compare and contrast the revenue growth and EBITDA of Uber and Lyft for year 2021


[Output]
```json
{{
"items": [
{{
"sub_question": "What is the revenue growth of Uber",
"tool_name": "uber_10k"
}},
{{
"sub_question": "What is the EBITDA of Uber",
"tool_name": "uber_10k"
}},
{{
"sub_question": "What is the revenue growth of Lyft",
"tool_name": "lyft_10k"
}},
{{
"sub_question": "What is the EBITDA of Lyft",
"tool_name": "lyft_10k"
}}
]
}}
```

# Example 2
[Tools]
```json
{tools_str}
```

<|eot_id|><|start_header_id|>user<|end_header_id|>

[User Question]
{query_str}

[Output]<|eot_id|><|start_header_id|>assistant<|end_header_id|>'''
)

Let us update the prompt template. Other than the prompt the prompt template has other things in it as seen below and I intend to keep it. SO I copy this in a variable, modify the prompt and assign it back to our query engine.

query_engine.get_prompts()['question_gen:question_gen_prompt']


-- OUTPUT --

PromptTemplate(metadata={'prompt_type': <PromptType.SUB_QUESTION: 'sub_question'>}, template_vars=['tools_str', 'query_str'], kwargs={}, output_parser=<llama_index.core.question_gen.output_parser.SubQuestionOutputParser object at 0x000001E63F46CEB0>, template_var_mappings=None, function_mappings=None, template='Given a user question, and a list of tools, output a list of relevant sub-questions in json markdown that when composed can help answer the full user question:\n\n# Example 1\n<Tools>\n```json\n{{\n "uber_10k": "Provides information about Uber financials for year 2021",\n "lyft_10k": "Provides information about Lyft financials for year 2021"\n}}\n```\n\n<User Question>\nCompare and contrast the revenue growth and EBITDA of Uber and Lyft for year 2021\n\n\n<Output>\n```json\n{{\n "items": [\n {{\n "sub_question": "What is the revenue growth of Uber",\n "tool_name": "uber_10k"\n }},\n {{\n "sub_question": "What is the EBITDA of Uber",\n "tool_name": "uber_10k"\n }},\n {{\n "sub_question": "What is the revenue growth of Lyft",\n "tool_name": "lyft_10k"\n }},\n {{\n "sub_question": "What is the EBITDA of Lyft",\n "tool_name": "lyft_10k"\n }}\n ]\n}}\n```\n\n# Example 2\n<Tools>\n```json\n{tools_str}\n```\n\n<User Question>\n{query_str}\n\n<Output>\n')
  • UPDATING the Prompt — after this our system will work with smaller size model also.
A = query_engine.get_prompts()['question_gen:question_gen_prompt']
A.template = llama_3_prompt_str
query_engine.update_prompts(
{"question_gen:question_gen_prompt": A})
llm = Ollama(model='llama3:8b-instruct-q5_K_M', request_timeout=240.0)

Settings.llm = llm

query_engine = SubQuestionQueryEngine.from_defaults(
[drake_tool, kendrick_tool],
llm=llm,
verbose=True,
)

A = query_engine.get_prompts()['question_gen:question_gen_prompt']
A.template = llama_3_prompt_str
query_engine.update_prompts(
{"question_gen:question_gen_prompt": A})

response = query_engine.query("Which albums did Drake release in his career?")

print(response)


-- OUTPUT --

This is escalating quickly and I want to share how different models work with different prompt templates. After this you can try even more, try different models — see which work as it is and which don’t. Tons of things to do after this.

Performance of different Llam3 and Prompt Templates

With this I wrap up. I agree this is not comprehensive and you can reach out to me if there is any specific query or feedback and I will try my best to address it. Looking forward to see more work and experimentation with smaller models and prompt engineering.

Other related posts —
Advance RAG — Query Augmentation using Llama 2 and LlamaIndex
Enhancing Retrieval Augmented Generation with — ReRanker and UMAP visualization: Llama_Index and Llama 2
Exploring RAG Implementation with Metadata Filters — llama_Index
Langchain agents and function calling using Llama 2 locally
PandasDataFrame and LLMs: Navigating the Data Exploration Odyssey

--

--