RAG easily explained: implementation with LangChain 🦜🔗, ChromaDB and OpenAI API

Published in

Target Reply | Insights Hub

8 min readJan 8, 2024

The demo is called ‘pot-AI-to’ in honor of the former president of the Italian AI Commission

In a nutshell

Delving deeper into Large Language Models (LLM) applications, I learned about the Retrieval-Augmented-Generation (RAG) architecture to optimize LLM responses. As I found limited step-by-step implementation of this technique, I created a Colab notebook for it.

This article is for you if you are looking for a Python implementation of the RAG technique to optimize LLM responses through a source-specific knowledge base stored in an open-source vector database.

Intro

To illustrate the RAG concept, let’s draw upon a notable statement made by the (former) president of the AI Commission in Italy during a ministerial hearing. He asserted that AI empowers journalists to swiftly access all potato recipes — a statement somewhat true but unexpected from a key figure in such a strategic commission.

In particular, to illustrate the added value of the RAG technique, I test the model on unfamiliar requests, such as recipes with a made-up ingredient, here called Amato’s secret stuff (from Micheal’s secret stuff — Space Jam fan here).

By using a made-up ingredient, one can better understand the model's behavior when faced with data it was not trained on and what to do when the LLM does not meet the expectations.

In this implementation, the task of requesting potato recipes serves just as an illustrative example emphasizing the importance of providing context and updated data to a pre-trained LLM in order to receive more factual and informative responses. While this project is rooted in that statement, this specific application is chosen for the meme rather than intrinsic significance, so focus on the architecture and not the actual task being performed (which you can customize to your specific need).

Retrieval-Augmented-Generation (RAG)

What is it

Retrieval Augment Generation (RAG) is a technique to augment the knowledge of LLM (Large Language Models) with additional data to put context and updated information into the prompts. Facebook AI introduced it in a research paper published in 2020.

A simple way to understand this concept is to think about your own knowledge. To make a comparison, not using a RAG on a LLM would be like for a person to use only the knowledge acquired during high school and other levels of formal education. Clearly, there would be several limitations in the person’s capability to properly complete all sorts of tasks in an accurate way.

For example, even when you start a new job, there is always domain-specific information that you need to embed in your own knowledge. Incorporating a RAG technique into an LLM application is no different from what you do to yourself when you read the documentation of a new project, the newsletter you subscribed to, books, and whatever medium (🥁did you get it?) you use to keep up with new information.

The RAG essentially consists of a way of updating the knowledge base and making it domain-specific.

Why do you need to know about it

Large Language Models are trained on a finite set of public data up to a specific point in time. When performing the task of predicting a token through the transformer architecture, the model can only use the parametric knowledge stored in its weights, hence creating the possibility to answer with:

inaccurate or outdated information (output data that is not relevant)
hallucinate (fabricate imaginary outputs presented with confidence).

One approach to increase the accuracy of the model and reduce the likelihood of these kinds of responses consists of updating a model with a RAG. This technique combines the parametric knowledge of the pre-trained model with the non-parametric one contained in domain expertise documents (e.g., internal documentation), recent data (whatever comes after January 2022), or any other format of proprietary data sources.

An alternative way to tackle the shortcomings of LLM would be to fine-tune a pre-trained model towards a specific customized task. However, this approach requires further training, meaning much more computational resources, money and time. Also, it might not be the most strategic choice in the case of frequently changing data sources and scenarios where you need a higher level of transparency (relatively to the fine-tuning, overall both have untraceable reasoning processes). Hence, the RAG technique is a more immediate and accessible way to enhance the capabilities and value delivered by LLM applications.

RAG Architecture

A RAG architecture has 2 components:

1. Indexing: component that incorporates the external source and indexes it. Typically happens offline and uses vector databases to store the new information.

2. Retrieval and generation: component that translates the external information in non-parametric knowledge. Essentially, the chain that takes the user query, retrieves the data in the vector database through its index and passes it to the LLM through the augmented prompt.

See the documentation for further details.

How I picture in my mind the LLM entering the vector database

Implementation

A common application of the RAG technique is in Q&A scenarios as it allows LLM’s answering power to be augmented with specific source information, making the output more specific, diverse, factual, context-aware, and overall more informative and accurate.

The code first showcases how to develop a conversational interface with a standard LLM through the help of LangChain.

Initialization of the conversational interface

Then, it displays the limitations of querying a pre-trained LLM. These are mainly:

having to face unfamiliar situations → the model is asked about recipes with a made up ingredient. To create an unfamiliar request for the example of potato recipes, I ask for recipes with a made-up ingredient, here called “Amato’s secret stuff”.
being asked about recent events → the model is asked about who’s the newly appointed president of the AI Commission in Italy.

This is the best-case scenario because in the worst one (in the notebook) it fabricates a potato recipe with the made-up ingredient that clearly has no factual reference but only replicates the pattern of a standard recipe.

To overcome such limitations, the RAG is first implemented in a naive way by simply incorporating the new knowledge as a string in the prompt.

Then, the architecture is further developed for more production-ready settings by storing the new information as vector embeddings into a vector database. For this implementation, the ChromaDB was chosen as it is open source and does not even require registration.

Then, create the collection in the vector database — I put these snippets to provide a walkthrough but executing the notebook will be more clear

Eventually, the database can be queried — for more instructions, see the documentation

Now that the new information is embedded and stored in the vector database, it can be used to augment the LLM and tackle its limitations.

Results

The first use case is developed with the contextualize_prompt function that augments the prompt with the data fetched from the vector database.

The new pieces of information are retrieved from the vector embeddings in the vector database

Notice the difference with the query before providing recipes with the made-up ingredient — that’s the power of the non-parametric knowledge incorporated with the RAG technique.

To test another use case, the database is further extended with a different type of information focusing on the latest news. In fact, the president resigned on Friday, but the LLM cannot know this using only its parametric knowledge. Hence, when asked “Who is the new president of the AI commission in Italy?” the answer it returns is similar to this below.

The model when told that the president of the commission has just resigned

However, by augmenting the prompt with the data fetched from the vector db (this time querying on the metadata), the model overcomes its limitations and accurately answers the query.

Before RAG with the latest news

Consider the value of this kind of application if the news were scraped programmatically or pulled from a public API and directly loaded into the vector database.

Technical choices

In the notebook, the code is thoroughly commented and all choice components are further explained, so I’ll keep this technical section concise. Essentially, the components are:

LLM: the large language model engine that powers the responses.
LangChain: a framework to architect the pipeline (Chain component) for applications powered by LLM (Lang-uage component).
Embedding model: the new source information needs to be stored in a machine-readable way, so embedding models are what translate this information into its mathematical representation (vectors).
Vector databases: these vectors need to be stored in a knowledge base that can handle high-dimensional data while providing efficient querying and retrieval mechanisms — and this is what vector databases are for.

Head to the notebook for more technical details. At any step of the chain, there is a great amount of potential complexity, so feel free to let me know if there are choices that you would have made differently.

Conclusions

Summing up, the RAG technique:

optimizes the LLM by making the responses more factual and accurate, reducing the likelihood of hallucinations by adding new targeted source information.
can be a more effective technique compared to finetuning, particularly in scenarios where data are frequently changing.

However,

the integration of additional knowledge introduces greater complexity, potentially leading to longer processing and development times as well as greater resource usage (computational resources but also in terms of cost per token). This means that there is the risk that the benefits do not outweigh the costs in a real production setting.
the real challenge is properly structuring the knowledge base, especially if this is siloed and comes in a range of different formats. As in all projects, if there is no proper documentation, there is no person (or AI) who can solve the job.

Additional info

The project is inspired by this video, but my implementation modifies the architecture by using the open-source vector database Chroma DB, which is free and does not require registration.
For delving deeper into the difference between RAG and finetuning, I believe this article has some useful insights.
For delving deeper into vector databases, I think this article is a useful starting point.