Let’s Normalize Online, In-Memory RAG!

Andrew Nguonly
The Deep Hub
Published in
7 min readFeb 1, 2024

--

This is the third article in a series of articles about Lumos, a RAG LLM co-pilot for browsing the web. Reading the prior articles is recommended!

ChatGPT prompt: “Generate an image of Retrieval Augmented Generation (RAG). RAG is all the rage! The image should mimic the scene of The Last Supper. It’s a party and everyone is excited about AI, LLMs, and RAG! The image should be in the style of a fun animation. Make the scene exciting and chaotic!”

RAG is all the rage.

By now, we’re all familiar with the “R” in RAG. “R” is for “retrieval” (Retrieval Augmented Generation), which refers to the process of retrieving documents that have been indexed into a vector store to supplement a prompt with the additional context before passing the complete context to a Large Language Model (LLM). RAG has become an effective architecture for augmenting LLMs. New information that an LLM has not been trained on can be provided to the LLM by a RAG pipeline. There’s plenty of research and experimentation with fine-tuning embedding models, search optimization, and document reranking, but little is being said about the operational burdens of maintaining a sophisticated RAG pipeline.

What does it take to operate and scale a robust RAG pipeline with near real-time updates? It’s a huge challenge, no doubt. Imagine the effort required to index Slack messages in real-time into an Elasticsearch cluster. This example alone contains more challenges with database operations than it does with RAG pipelines.

Off-the-cuff post…

In this article, I want to shed light on what I feel is an undervalued or underrated approach to RAG: online and in-memory document embedding generation. Instead of focusing on the “R”, I’ll turn our attention to the steps that precede retrieval. In the implementation of Lumos, a RAG pipeline is executed in the Chrome extension’s background script. Documents are stored in memory in LangChain’s MemoryVectorStore and retrieved immediately after indexing. The entire workflow of generating embeddings and prompting the LLM with the retrieved documents happens in a single request context initiated by the end user.

The approach certainly has notable downsides (e.g. increased latency). However, I suspect this version of RAG may have more undiscovered use cases and in some cases may be more optimal (e.g. easier to operate, cheaper) than a real-time, offline architecture. This article explores the implementation of online, in-memory RAG embedding generation in Lumos.

Definitions 📋

Before we begin, it’s helpful to clarify the two highlighted terms.

  1. Online refers to the state where a procedure occurs during the normal execution of an application. In contrast, “offline” refers to the state where a procedure occurs outside the context of an application and not necessarily during the application’s runtime.
  2. In-memory refers to an application's immediate allocated memory. In contrast, an external data store (e.g. Elasticsearch) is not in-memory.

Simply browsing LangChain’s vector store documentation demonstrates the plethora of options for external data stores as vector stores.

Naive RAG In the Background 😅

The original RAG pipeline of Lumos processed all content on the current tab each time a prompt was issued by the user. There was no caching of any kind. The vector store initialized in memory was cleared after each run. The following is an old code snippet of the original implementation.

// split page content into overlapping documents
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: chunkSize,
chunkOverlap: chunkOverlap,
});
const documents = await splitter.createDocuments([context]);

// load documents into vector store
const vectorStore = await MemoryVectorStore.fromDocuments(
documents,
new OllamaEmbeddings({
baseUrl: lumosOptions.ollamaHost,
model: lumosOptions.ollamaModel,
}),
);
const retriever = vectorStore.asRetriever();

// create chain
const chain = RunnableSequence.from([
{
filtered_context: retriever.pipe(formatDocumentsAsString),
question: new RunnablePassthrough(),
},
formatted_prompt,
model,
new StringOutputParser(),
]);

The design has obvious issues. If a user issued multiple prompts from the same page, the app would naively reprocess the same content for each request. Calling the Ollama embeddings API in sequence is not necessarily a fast operation. For a webpage with a lot of text, generating embeddings for all documents could take well over a minute before a response is generated from the LLM.

Ollama server logs

Preserving the vector store was the obvious low-hanging fruit, but there were several considerations and constraints taken into account.

Browsing Session RAG 🥳

The updated RAG pipeline in Lumos is driven by memory constraints and user behavior. It’s important to highlight the user’s behavior because it’s likely that typical usage/behavior will be the driving factor behind other online, in-memory RAG implementations.

Considerations and Constraints 🪨

  1. Chrome’s background script memory is limited (i.e. there can’t be too many documents stored in memory).
  2. Most documents will become out of date after some time. Typically, users aren’t browsing the same websites constantly or the website’s content is changing frequently.
  3. Documents should not be indexed if they were just indexed.
  4. Highlighted content is an exception. Highlighted content should always be indexed in a vector store for RAG retrieval.
  5. Ideally, vector store documents should be deleted after a “browsing session” is complete.

URL-Level Vector Store Caching 💵

Given the prior considerations and constraints, a URL-level vector store cache is implemented so that documents from a single URL are cached together. Tracking individual documents across all URLs is challenging because each document does not inherently have a unique identifier that is easily accessible (i.e. reproducible without ID collisions). Instead, the app relies on the URL as the ID for each document from the page. This approach immensely simplifies the logic for caching and naturally aligns with a user’s typical behavior for browsing the internet. A Map object is created in the Chrome extension’s persistent background script to store a MemoryVectorStore for each URL.

interface VectorStoreMetadata {
vectorStore: MemoryVectorStore
createdAt: number
}

// map of url to vector store metadata
const vectorStoreMap = new Map<string, VectorStoreMetadata>();

At the beginning of each prompt request, the background script evicts all MemoryVectorStore instances that have exceeded the configurable TTL set for all vector stores. An aggressive TTL (time to live) may seem unusual, but if we think about the types of content we consume on the internet, it’s very seldom that we constantly return to a URL containing unchanged or infrequently changed content. That being said, there are notable exceptions (e.g. Wikipedia articles).

// delete all vector stores that are expired
vectorStoreMap.forEach((vectorStoreMetdata: VectorStoreMetadata, url: string) => {
if (Date.now() - vectorStoreMetdata.createdAt! > lumosOptions.vectorStoreTTLMins * 60 * 1000) {
vectorStoreMap.delete(url);
console.log(`Deleting vector store for url: ${url}`);
}
});

Next, a new vector store is declared. The declared identifier is either bound to an existing vector store in memory (with all the documents already indexed) or a new MemoryVectorStore is initialized, which would then trigger the process of generating new embeddings.

// check if vector store already exists for url
var vectorStore: MemoryVectorStore;

if (!skipCache && vectorStoreMap.has(url)) {
// retrieve existing vector store
console.log(`Retrieving existing vector store for url: ${url}`);
vectorStore = vectorStoreMap.get(url)?.vectorStore!;
} else {
// create new vector store
console.log(`Creating ${skipCache ? "temporary" : "new"} vector store for url: ${url}`);

// split page content into overlapping documents
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: chunkSize,
chunkOverlap: chunkOverlap,
});
const documents = await splitter.createDocuments([context]);

// load documents into vector store
vectorStore = await MemoryVectorStore.fromDocuments(
documents,
new OllamaEmbeddings({
baseUrl: lumosOptions.ollamaHost,
model: lumosOptions.ollamaModel,
}),
);

// store vector store in vector store map
if (!skipCache) {
vectorStoreMap.set(url, {
vectorStore: vectorStore,
createdAt: Date.now(),
});
}
}

const retriever = vectorStore.asRetriever();

The retriever is used later when constructing the LangChain chain. The implementation also has functionality to skip the caching workflow completely. This is needed for the use case where highlighted content is supplied to the background script.

Next Steps 📶

After the addition of the cache, the latency of subsequent prompt requests is greatly improved. LLM response generation happens almost immediately.

Ollama server logs for 2 consecutive prompt requests

There are still many improvements left to make in Lumos’s RAG pipeline. Ideally, calls to Ollama’s embeddings API are made in parallel. Search optimization techniques and document reranking can be implemented. However, a real-time offline approach is not considered at this time. The burden of operating an external data store for RAG is outweighed by the ease of implementation and operations of an online, in-memory approach, especially for a completely client-side application.

Use Cases for Online and In-Memory 🧠

What are other use cases for online, in-memory RAG?

A couple of users on Hacker News brought to my attention the use case of using Lumos with web apps for chat products (e.g. Discord, WhatsApp).

Screenshot from Hacker News

This reminded me of a funny troll I saw on Twitter about indexing Slack messages into an offline RAG pipeline (which I cannot find at the moment). Do we really need to index every Slack message into a vector store? Do we need to keep them for all time? When was the last time you searched for context from 1 year ago? I suppose this approach is useful if Slack has become the defacto source of truth for a company’s knowledge base. If that is the case, then there may be bigger challenges that the company is facing.

When building a RAG LLM app, consider online, in-memory RAG. If the approach does not rise to the occasion, we know there’s always Elasticsearch waiting in the shadows 🫣

References

  1. Lumos (GitHub)
  2. Local LLM in the Browser Powered by Ollama (Part 1)
  3. Local LLM in the Browser Powered by Ollama (Part 2)
  4. Supercharging If-Statements With Prompt Classification Using Ollama and LangChain (Part 4)
  5. Bolstering LangChain’s MemoryVectorStore With Keyword Search (Part 5)
  6. A Guide to Gotchas with LangChain Document Loaders in a Chrome Extension (Part 6)

--

--