A tale of RAGs, metadata and LLMOps.

Fares
5 min readMar 10, 2024

--

There is no need to introduce LLMs anymore. They’ve been thrusted so much into the public eye that anyone with basic knowledge in internet usage has at least heard of ChatGPT.

Internet connoisseurs will notice LLM applications everywhere — from reimagining search engines (like perplexity.ai), assisting software engineers (Github Copilot), healthcare (Nabla…) — but one thing you’ll notice is the scoped expertise of the LLM technology in the different fields. Indeed, Copilot won’t tell you how to make a chocolate cake such as ChatGPT, but will provide you with very accurate boilerplate code of a web server in Python, more accurate than ChatGPT.

The question is, where lies this expertise? Are we limiting the user prompts? Are these LLMs dumbed down ? Nope, it’s either the result of very powerful fine tuning (which is costly, time consuming and relies on a lot experimenting with trial and error), or RAG (Retrieval Augmented Generation).

What is RAG?

RAG is the act of providing a scope to your queries. It can be a set of documents, PDF files, a dataset, a list of repositories, medical articles…etc.

Simply put, it consists of drawing the borders to the “unlimited potential” of LLMs. Using the strongest advantages of language models, with a provided data source, it sounds like the best alternative to a costly fine tuning, with a relatively high quality of answers.

RAG Diagram — (LlamaIndex)

Pros:

  • Relatively easy setup if you use something like LlamaIndex.
  • Relatively mature environment (embedding, vector databases, data ingestion…etc).
  • High quality results compared to a generic LLM, minus the time it would’ve taken to fine tune the model.
  • Data governance : you handle the data source, and scope your LLM to that data only.

Cons:

  • Extremely prompt dependent (wrong prompts lead to wrong results).
  • Somewhat generic if you don’t configure it (especially if you’re in a multi-tenant/user situation).

And most applications fall into the multi-tenancy/multiple user situation. Businesses leveraging LLMs want to serve a personalized experience with their users. For example, we are a healthcare startup. We want to ensure that the agent assisting the doctor with the patient is limited (and limited only, for medical secrecy/confidentiality reasons) to that conversation and that patient’s information: how can we manage that to be a) HIPAA compliant and b) useful to both the doctor & the patient?

That’s where metadata comes into play!

Metadata-driven RAG

Please forgive my artistic skills

Let’s go back to our example. A truly personalized experience can’t be done with simple RAG, it needs to be driven with user specific, context specific metadata.

Let’s see a concrete example.

def generate_datasource(metadata: dict = {}):
try:
logger.info("Creating new index")
# load the documents and create the index
documents = get_documents(metadata)
storage_context = StorageContext.from_defaults(
vector_store=get_vector_store()
)
VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
except Exception as e:
logger.error(f"Error while creating new index: {e}")
raise e

Let’s suppose we are an edtech platform, wishing to provide personalized experiences between an educational agent & a student. We can store files provided by students (such as courses on the Pythagorean theorem), ingest them into chunks in our vector store, and tag them with relevant metadata (Which topic? Which student?)

Topic and user-relevant metadata
Math reminder… in French!

We end up with an agent capable of answering relevant questions in any language:

English on the left, Arabic on the right

And will refuse to use outside data for completely irrelevant ones:

This is great & all. You are a business that’s able to maintain control over your data sources, you can scope the context of your queries while maintaining user-specific personalized experiences. What’s missing?

Control & tracking.

Enter: LLMOps monitoring.

Build monitoring using Langfuse & LlamaIndex

Langfuse is an add-on library used to track the expenses & interactions between your app, the used prompts as well as the used data sources in our case. It can be integrated as a plugin for tools like Langchain, LlamaIndex.

Langfuse comes into 2 editions: the cloud edition, and the self-hosted one. Either way, it works in similar ways:

Retrieval tracking
Generation tracking

It can track over several layers of tracking, retrieval, generation, embedding…etc.

Langfuse’s integration with LlamaIndex is pretty straightforward too:

def init_base_settings():
model = os.getenv("MODEL", "gpt-3.5-turbo")
Settings.llm = OpenAI(
model=model,
)


def init_settings():
init_base_settings()
langfuse_callback_handler = LlamaIndexCallbackHandler()

Settings.callback_manager = CallbackManager([langfuse_callback_handler])
Settings.chunk_size = CHUNK_SIZE
Settings.chunk_overlap = CHUNK_OVERLAP

Conclusion

With the advent of LLM-driven apps, several topics will rise. With open source alternatives to OpenAI, such as Mistral for example, companies will consider the possibility to self host their own models. Chief Data Officers and Chief Security Officers will be less & less open to the idea of sharing private business and/or customer data with external partners — fine tuning and RAG will satisfy both data governance & user experience requirements from security & business teams.

Both developers & executives will push for transparency and metrics, and a new generation of LLMOps tooling will provide the best dashboard and numbers to satisfy both technical and non-technical people.

LLM apps will soon stop being exempted from regular business apps. It will be up to both — LLM providers and ecosystem members — to build the right solutions and tools to find new ways to comply to the same old requirements.

If you’ve enjoyed reading this app, don’t hesitate to share it and leave a comment ! If you have any feedback, feel free to share it!

See you soon!

--

--