Exploring RAG Implementation with Metadata Filters — llama_Index

Sandeep Shah
7 min readMar 16, 2024

--

Hello friends ! Welcome back..

Today, I am trying out another feature specifically for RAG implementation. RAG essentially involves using any pre-trained LLM on our own documents (which can also be sourced from the internet). I won’t delve into the details of RAG here. This post assumes that you have experimented a bit with LLMs and llama-index. Today, I will demonstrate how we can utilize the llama-index metadata filters. I will provide an example of how to create metadata and then explain how to use it in your queries.

Caution: I’m currently using llama-index version 0.9.34/48, and unfortunately, some of the features I intended to utilize do not seem to function as expected. For instance, while llama-index does provide the capability to apply Filter A or Filter B, this functionality appears to be non-operational. It instead always takes filter A and filter B. OR operation is not working for me as of now. I’ll provide a clear example to illustrate this issue. Although I attempted to transition to the latest release of llama-index, I encountered challenges and was unable to successfully deploy it. However, it’s possible that this issue has been addressed in the newer release. If you are already using the latest release (0.10.*), I would appreciate knowing if the ‘OR’ condition is functioning correctly for you. Nevertheless, the current features we have are still valuable and warrant exploration. In this post, I will predominantly focus on filtering and will not delve into prompt engineering or the accuracy of text generation. Additionally, I’ll discuss the potential of leveraging filtering in combination with other concepts such as agents and named entity recognition to develop an advanced RAG system.

As always, in this post, I will share code snippets and critical outputs. Towards the end, I will provide the link to the notebook containing all the steps, allowing you to delve further into exploration.

For this demonstration, I will once again refer to my blog post data. I have accumulated over 100 posts on Blogger.com (https://sandeeprshah.blogspot.com/) and scraped them using Python, saving each post as a text file. The filename serves as one of our metadata components. Metadata serves to provide additional details about our text, in this case, the filename. Additionally, when saving the files, I include the year of publication as part of filename and we will use this as of the metadata.

Creating Metadata

Example — Exploring Airport Lounge__2010.pdf, here
post_title=
Exploring Airport Lounge’
post_year=2010
Below is the code to read the documents and extract these information and save as metadata.

#location of files
transcript_directory = r"Sandy_blogspot_pdf"

# Add post title and post year as metadata to each chunk associated with a document/transcript
filename_fn = lambda filename: {'post_title': os.path.splitext(os.path.basename(filename))[0].split('__')[0],
'post_year': os.path.splitext(os.path.basename(filename))[0].split('__')[1]}
documents = SimpleDirectoryReader(transcript_directory, filename_as_id=True,
file_metadata=filename_fn).load_data()
documents[12].metadata
---sample output---
{'page_label': '1',
'file_name': 'Sandy_blogspot_pdf\\A decade of transition__2020.pdf',
'post_title': 'A decade of transition',
'post_year': '2020'}


documents[89].metadata
---sample output---
{'page_label': '1',
'file_name': 'Sandy_blogspot_pdf\\Inch by Inch__2018.pdf',
'post_title': 'Inch by Inch',
'post_year': '2018'}

You can see how easy it is to add metadata. You can have sentiment of your document as metadata, or frequency of some of the words or extract names of person involved and then store that as metadata.

Creating Chunks

You can use sentence window chunking or other methods too — not touching that part and using a simple size based chunking.

# Exclude metadata from the LLM, meaning it won't read it when generating a response.
# Future - consider looping over documents and setting the id_ to basename, instead of fullpath
[document.excluded_llm_metadata_keys.append('post_title') for document in documents]

parser = SimpleNodeParser.from_defaults(chunk_size=600, chunk_overlap=50)
pdf_nodes = parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes=pdf_nodes)
print('Number of documents:'+str(len(documents)))
print('Number of nodes:'+str(len(pdf_nodes)))
---Output---
Number of documents:196
Number of nodes:218

I have approximately 110 posts, but the SimpleDirectoryReader we utilized saves each page as one document. Some of my posts span multiple pages, resulting in multiple documents for a single post. Consequently, each page is chunked separately, and based on the content, some pages may be divided into more than one node. Therefore, we have a total of 218 nodes.

Different Methods of Filtering

Method — 1: The code snippets below show filtering on individual metadata key. Then I also show how we can add filters on two different keys in the same filter.

filters = MetadataFilters(filters=[
ExactMatchFilter(
key="post_title",
value='A decade of transition'
),
])

retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:A decade of transition, Year:2020
title:A decade of transition, Year:2020
filters = MetadataFilters(filters=[
ExactMatchFilter(
key="post_year",
value='2013'
),
])

retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:Congrats Sandeep !!!, Year:2013
title:Night before the Ride, Year:2013
filters = MetadataFilters(filters=[
ExactMatchFilter(
key="post_year",
value='2013'
),
ExactMatchFilter(
key="post_title",
value='Night before the Ride'
),
])

retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:Night before the Ride, Year:2013

You can see if we use combined filters then our search space is reduced and we get to specific document/chunk. Below I show same filters but a slightly different way of implementation. Below one is more recent implementation.

Method 2:

filters = MetadataFilters(
filters=[
MetadataFilter(key="post_title", value="A decade of transition"),
],
)


retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:A decade of transition, Year:2020
title:A decade of transition, Year:2020
filters = MetadataFilters(
filters=[
MetadataFilter(key="post_year", value="2013"),
],
)


retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:Congrats Sandeep !!!, Year:2013
title:Night before the Ride, Year:2013
filters = MetadataFilters(
filters=[
MetadataFilter(key="post_year", value="2013"),
MetadataFilter(key="post_title", value="Night before the Ride")
],
condition='and' # Here we can also give condition as 'or' but this is not working
# You can also write it as 'condition=FilterCondition.AND'
)


retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:Night before the Ride, Year:2013

Using OR condition but getting output as AND condition. Feel free to look into the issue or stay tuned to see if this is fixed by a new release.

filters = MetadataFilters(
filters=[
MetadataFilter(key="post_year", value="2013"),
MetadataFilter(key="post_title", value="Night before the Ride")
],
condition='or'
)


retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running") # for this post - the prompt I give here is not critical

# printing out the metadata
for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:Night before the Ride, Year:2013

Now, consider a scenario where you need to provide multiple values for the same filter. For instance, if I want to retrieve posts from either 2015 or 2017, the process aligns closely with what was described earlier. However, due to the OR operator’s malfunction, the desired outcome isn’t achieved. Below, I demonstrate that the syntax is correct, and no errors are encountered. The only problem arises with the OR condition itself.

filters = [
MetadataFilter(
key='post_year',
value=title,
operator='==',

)
for title in ['2015', '2017']
]

filters = MetadataFilters(filters=filters, condition="or")

retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running")

for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
I get no output actually
filters = [
MetadataFilter(
key='post_year',
value=title,
operator='==',

)
for title in ['2015', '2015']
]

filters = MetadataFilters(filters=filters, condition="or")

retriever = index.as_retriever(filters=filters)
docs = retriever.retrieve("Marathon running")

for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])

# ---Output---
title:First Half Marathon + 2015 Resolutions, Year:2015
title:Reflecting on First Half of 2015, Year:2015

Finally putting filter in query engine —

%%time

filters = MetadataFilters(
filters=[
MetadataFilter(key="post_year", value="2017"),
],
)

# You pass filter as an argument. You can have any type of filter
# we saw above and then pass it to query engine.
query_engine = index.as_query_engine(service_context=service_context,
similarity_top_k=5,
filters = filters,
response_mode='tree_summarize')

response = query_engine.query("Marathon Running")
print(response)

print('\n Metadata')


for i in range(len(docs)):
print("title:"+docs[i].metadata['post_title']+", Year:"+docs[i].metadata['post_year'])


# ---OUTPUT---

Based on the information provided in the three documents, the theme that emerges\
is the importance of self-motivation and self-belief. The author of the first
document made a promise to themselves to never doubt their capabilities and to
always remember this promise when facing challenges. The author of the second
document also made a promise to themselves to never give up on their dreams
and to keep pushing themselves, despite feeling discouraged or defeated.
The author of the third document is currently trying to accomplish a task that
they have been struggling with for some time, and they are using their own
experiences and the power of self-belief to motivate themselves and push
through their challenges.

Overall, the theme that emerges is the power of self-belief and self-motivation
in overcoming challenges and achieving one's goals. The documents highlight
the importance of having faith in oneself and one's abilities, and of never
giving up on one's dreams and aspirations.


Metadata
title:First Half Marathon + 2015 Resolutions, Year:2015
title:Reflecting on First Half of 2015, Year:2015

Next Steps

What one can do from here is a lot — one is to have a dynamic filter, taht is based on query edit the values passed to filter. Let us take an example.

Prompt — ‘Did author run any marathon in 2016?’

  1. One approach is to employ an agent/function calling to extract or determine if the user is requesting information for a specific year. Alternatively, named entity recognition can be utilized to detect mentions of titles or years.
  2. Once the year has been identified, apply the filter to the query engine to retrieve only the relevant documents.
  3. Proceed to address the prompt based on the filtered documents.

Metadata filters, when combined with agents, chunking, and other techniques, offer a plethora of possibilities. If you have any ideas or are already engaged in similar endeavors, I’d love to hear from you. Let’s connect and innovate together. Feel free to share your thoughts on the usefulness of this information and whether you’d like to see more posts on this topic or something else. Your feedback is invaluable.

Code — https://github.com/SandyShah/llama_index_filters.git

--

--