Multimodal RAG for URLs and Files, in 40 Lines of Python

Emmett McFarlane
4 min readJun 3, 2024

--

Doing this is easy, and only requires a few lines of vanilla Python. We certainly don’t need frameworks or dozens of dependencies, despite what Langchain and others may tell you in their advertising.

How retrieval-augmented generation works. Image from Chroma.

Vision-language models can generate text based on multimodal inputs. However, they have a very limited useful context window. Retrieval-Augmented Generation (RAG) is a technique that allows you to bypass this problem by equipping a large language model (LLM) with a searchable knowledge base.

While you may have tried to apply this technique for your data with the new multimodal format for OpenAI’s GPT-4o, you may find it unreasonably difficult to do with deeply complex frameworks such as Langchain, LangGraph, LlamaIndex, or CrewAI.

At least I did.

Trust me — doing this is easy, and only requires a few lines of vanilla Python. We certainly don’t need frameworks or dozens of dependencies, despite what Langchain and others may tell you in their advertising.

Some caveats that make this process seem so difficult are as follows:

  • The steep learning curve for these heavy frameworks an be overwhelming, especially if your goal is to build something functional rather than hopping to the newest framework based on trends.
  • Most vector database codes assume that “multimodal” means you want to index images and text as vectors. In the vast majority of my past use cases, I wanted to index only the text, and I wanted the images to be available for prompt upon retrieval.
  • Nearly all modern data extraction infrastructure assumes you are using a text-only language model. That means your PDF loaders, URL loaders, and so on will not capture complex visual data such as tables, charts, figures, etc. That is so 2023.

Here, I’ll show you how I set up multimodal RAG on my documents using The Pipe and ChromaDB in just 40 lines of Python.

The Idea

With RAG, documents are stored in a database indexed by their vector embeddings. Indexed this way, contextually relevant information can be given “on-the-fly” to the LLM to improve response quality based on the user’s exact query. This guide assumes you already have an understanding of RAG, how it works, and why it’s useful (if you don’t, click here). Here’s a quick guide on how to implement RAG for GPT-4-Vision on your documents using The Pipe and ChromaDB:

  1. Create a Collection: We start by setting up a collection in ChromaDB with a multimodal embedding function.
  2. Extract Data: Using The Pipe, we extract data from a specified source into prompt messages.
  3. Prepare Chunks: These messages are then chunked into RAG-ready segments.
  4. Embed Text: Each chunk’s text is embedded into the collection with its corresponding prompt message as metadata.
  5. Retrieve Prompts: Retrieve relevant prompt messages from ChromaDB.
  6. Generate Response: These messages are then fed into GPT-4-Vision to generate a response.
We’ll be using GPT-4o in this example, which as of June 2024 offers state-of-the-art performance.

The Setup

Before our implementation, ensure you have the necessary packages installed and the API keys set up. You will need keys for the OpenAI API and ThePipe API. Alternatively (albeit with some extra setup), you can use a local language model as well as a local installation of ThePipe at no cost.

Obtain API Keys

  1. OpenAI API Key: Visit OpenAI to get your API key. You can opt to do this without a key by using a (less intelligent) local language model such as LLaMa 3, which you can set up on with this guide.
  2. ThePipe API Key (optional): Register for the API and obtain your API key. You can opt to do this without a key by following the local installation instructions (more details in the documentation).

Installations & Environment Variables

Set the API keys as environment variables. For Windows:

setx OPENAI_API_KEY "your_openai_api_key"
setx THEPIPE_API_KEY "your_thepipe_api_key"

For Mac:

export OPENAI_API_KEY="your_openai_api_key"
export THEPIPE_API_KEY="your_thepipe_api_key"

Then install the necessary libraries using pip:

pip install openai requests openai thepipe_api chromadb

Restart your terminal to apply all of these changes.

The Code

The following scripts are designed to be run independently. The first script adds new documents to the vector database, and the second script queries the database to retrieve relevant content and generates a response with GPT-4o.

Script 1: Add docs to your vector database

from thepipe_api import thepipe
import chromadb
import json

def add_documents_to_collection(data_source, collection_name):
# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="db")
collection = chroma_client.get_or_create_collection(name=collection_name)
# Prepare RAG-ready chunks from a data_source
messages = thepipe.extract(data_source)
chunks = thepipe.core.create_chunks_from_messages(messages)
# Embed the text for each chunk, with the prompt message as metadata
for i, (chunk, message) in enumerate(zip(chunks, messages)):
if chunk.text: # if there is no text, item is skipped
collection.add(
ids=[data_source + str(i)],
documents=[chunk.text],
metadatas=[{"message": json.dumps(message)}]
)

if __name__ == "__main__":
data_source = "https://arxiv.org/pdf/0806.1525.pdf"
collection_name = 'vectordb'
add_documents_to_collection(data_source, collection_name)

Script 2: Retrieve and generate response

from openai import OpenAI
import chromadb
import json

def query_vector_db(collection_name, query):
# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="db")
collection = chroma_client.get_collection(name=collection_name)
# Retrieve prompt from ChromaDB related to the user query
retrieved_messages = collection.query(query_texts=[query], n_results=4)['metadatas'][0]
retrieval_messages = [json.loads(md['message']) for md in retrieved_messages]
# Prepare prompt message for the user query in OpenAI format
user_message = [{"role": "user", "content": [{"type": "text", "text": query}]}]
# Generate response from GPT-4-Vision using prompt messages
openai_client = OpenAI()
response = openai_client.chat.completions.create(
model = "gpt-4o",
messages = retrieval_messages + user_message
)
print(response.choices[0].message.content)

if __name__ == "__main__":
collection_name = 'vectordb'
query = input("Enter your query: ") #
query_vector_db(collection_name, query)

Happy coding! 🚀

--

--

Emmett McFarlane

ML engineering & astrophysics geek in Toronto. Nothing makes me prouder than building AI pipelines and seeing them work in production.