RAG: prepare your knowledge domain for Retrieval-Augmented Generation

6 min readJan 24, 2024

When using a Large Language Model (LLM) like ChatGPT to generate a context aware answer to a user query, the most important thing is to start building a, or multiple, knowledge source adapted to the expected usages.
But we mustn’t restrict ourselves are only using databases to contain those knowledge sources.

A brief reminder of what Retrieval-Augmented Generation is

The RAG is about asking a LLM to answer a query using a given context text. This text is provided by a retriever which use a query as an input to select/fetch documents.

Retrieval-Augmented Generation (RAG), or the art of feeding the prompt

Picture by Hert Niks on Unsplash

medium.com

The retriever

In basic cases the retriever is given the user original query as is.

In general it is first transformed in order to get more relevant results (ie: we ask an LLM to generate a condensed query, that is to say one that take into account the previous interactions between the user and the chatbot/system).

Query context aware RAG, suitable for conversational RAG. Illustration by the author.

In advanced cases, the Retriever can choose between multiple data sources/knowledge bases, use filters based either on the conversation context or determined from the query.

The augmented generation part

The LLM is not provided with only the user query, but with a prompt asking to answer a query using the provided text context, that is to say a prompt using:

the text outputted by the retriever,
a query that might have been first transformed (ie: to take into account the previous interactions).

The prompt can also be used to customise how will the LLM answer, like:

asking for a specific output format (text, json, markdown, …),
instructions about the «role» the LLM play (helpful assistant, sales representative, …),
data about the user (how to adress him/her, the expected level of the user in the field),
…

Considerations about the retriever part

The retriever must provide text relevant to the query. Which is best performed by considering the knowledge domain as dynamic.

What is a static/closed knowledge domain?

A static knowledge domain is what is proposed by lots of tutorials and articles about RAG which are based on vector store. In a closed knowledge domain we try to ingest as much knowledge as possible to answer every cases, but with the drawback of increasing the risk of no usable documents provided by the retriever.

In those tutorial propositions, the data is first indexed into a database as embeddings (vector representation of the text). Those indexes will then be used at query time to try to find the data that is semantically close to what we intend to find.

pros:

you can easily find data about an item (ie an article) using a description of it’s content just by providing the query,
it is quite easy to set up with frameworks such as LangChain or LlamaIndex.

cons:

if you have a query asking for data about a specific item you can’t be sure the retriever will actually give text related to this item.

For instance: you ask how many passengers a A380 can carry, the provided will give you x texts about passengers capacity, but you have no guarantee any of those are about the A380 as this text has no real semantic meaning.

the index must be prepared before hand

If you take the illustration above,

the blue circle is about what a vector store without filtering might return as k-nearest documents,
the green circle is about what a database would return by filtering about mentioned items (like the A380).

Basing our RAG only on the semantic search would only retrieve blue documents, but hopefully the overlap between the two circles should help answer the query.

The issue is there is no guarantee this overlap exists in a real life query. So to answer items related query, a filter must be applied on the vector store to have only documents on this overlapping area.

What is a dynamic/adaptative knowledge domain?

I will start here by giving you a use case that needs one. Consider you ask your chatbot / RAG the following question.

What are my appointments for today?

the retriever mustn’t return:

other users’ appointments,
appointments for another day,

but it must return:

an up to date list of appointments,
sorted by chronological order,

according to the exact given query:

it might skip appointments that have already taken place,
it might skip personal appointements and return only professionals ones (or vice-versa).

In this example, the knowledge domain should be at the same time:

opened, that is to say it needs to have access to up to date data fetched at query time,
restricted, to contain only data relevant to the user and it’s query.

It’s what I call a dynamic knowledge domain as it adapts to the need.

Can we still use a vector store to build a dynamic knowledge base?

Even with their limitations, vector stores remain great to extract parts from large texts that are the most likely to answer the query.

In a fully dynamic approach, the query and user context should be used to determine both keywords, to get text documents for, and filters to restrict the results.

If those texts are too long, they can be cut into smaller parts and then ingested into a vector store which can be:

in memory only and used only for the query,
or persisted (to cache previously ingested documents)

the query (and filters in case of persisted vector store) will then be used to extract relevant parts from the original documents.

Notes about the usage of LLM in a RAG system

While in basic systems they are used only at generation time, in more complex usages they can also be used at different stages.

Before retrieval

An LLM can be used before retrieval in order to transform the user query into one taking into account different elements like the chat history.

Remember an LLM has no memory by itself, if you want it to answer by taking into account past questions and answers, you must provide a question that condense those interactions.

At retrieval time to select documents

At this stage we can use an LLM

to select relevant data sources (use a database? call an API? …),
to determine filters to apply / parameters to call an API with,
to transform (ie summarizing) fetched document parts,
to generate intermediate questions, and generate answers related to those intermediate answers. The retriever will then provide the question-answer discussion to the final LLM as the context to answer the user query. (This is call a chain of thoughts - CoT)

NB: A chain of thoughts is what can be used to answer some questions that imply multiple steps like.

“When was born the richest man in the world ?”

Thoughts: “Who is the richest man in the world”
Answer: “Elon Musk is the richest man in the world”
Thoughts: “When was born Elon Musk”
Answer: “Elon Musk was born on 1971”
Final Answer: “The richest man in the world is Elon Musk, he was born in 1971”

Thanks for reading!

Feel free to comment or contact me!