Don’t put RAG projects into production too quickly!

13 min readJun 3, 2024

In our consulting missions, we have encountered numerous projects using LLMs over the past few months. The vast majority are of the RAG type. When we study these applications in depth, almost all of them exhibit significant weaknesses that require code revisions before production deployment. Nevertheless, these projects are often deployed because the problems only manifest in edge cases or under heavy loads.

The issue that concerns us is the resilience of applications. Resilience means having an application capable of handling all types of difficulties. This means that the application should be able to be interrupted at any moment without major impact on stability. Why might an application be abruptly interrupted? (see fallacies of distributed computing) Reasons include an application update, a server crash, network issues, the activity of a Chaos Monkey, etc. Murphy’s law reminds us that the worst will always happen (The Law of Maximum Trouble).

Let’s explore the different problems we encounter to understand them and propose alternative solutions.

First, a brief reminder of RAG architectures. For more details, we invite you to ask Google.

A RAG architecture is divided into two parts. The first part involves importing documents from various sources (websites, wikis, FAQs, PDFs, etc.), splitting them into more or less consistent portions, calculating a semantic vector (embedding), and placing this vector in a vector database.

The second part involves taking the user’s natural language question as input, calculating the corresponding semantic vector, and then retrieving, for example, the four document fragments with very similar vectors. It is highly likely that the answer to the question is in one of these fragments. The four fragments are then injected into the prompt along with the question, and the LLM is invoked to provide an answer.

Now, let’s look at the various problems that a RAG project might encounter. We have identified several problematic scenarios in different applications:

The impact of managing different file formats from cloud storage
The lifecycle management of chunks
Simultaneous access to the application by many users
The implementation of the API
The implementation of the user interface

Demos work only once

There are hundreds of code demos showing how to set up this type of architecture. Often, they work only once without the developer realizing it. Let’s detail the scenario.

A list of documents is split into chunks.

Each chunk is stored in a vector database.
The user’s query is used to retrieve the four closest chunks.
The LLM is invoked and answers the question.

So far, so good. Then, the same code is executed again (be cautious with notebooks!).

The same list of documents is split into chunks (with the same values as before).
Each chunk is stored in the same vector database.
The user’s query is used to retrieve the 4 closest chunks. However, since each chunk is present twice, the vector database returns the first chunk twice, then the second chunk twice. Ultimately, these 4 chunks (actually 2 chunks) are injected into the prompt.
The LLM responds using only 2 chunks (and not four). If the first or second chunk is enough to answer, no one notices anything. However, if it was the third or fourth chunk that provided the answer during the first invocation, the response is no longer as good.

Thus, the application degrades more and more with each cycle. We must precisely track the lifecycle of all chunks.

This brief reminder will help us understand the impacts of duplicate vectors in a vector database.

How is the diversity of file formats managed?

A framework like Langchain offers numerous Loaders. These are components capable of extracting a text stream from any type of file. If we look at the code a bit, the process often takes place in several stages. For example, if the file we want to handle is a CSV located in an S3 cloud storage (AWS), it needs to be downloaded to a temporary directory before it can be analyzed. Too often, temporary files are not deleted at the end of the process. Over time, the disk will become saturated. In development, this issue is easily overlooked. But in production?

The question to ask is: How does the temporary directory evolve with the use of the application?

Files are considered healthy during their analysis. This is not always the case. Indeed, partial files (where the copying process was interrupted) or unstable files (being updated at the same time as the program is loading them into memory) are common, or simply, the file might use features that the parser doesn’t recognize, causing it to crash.

Question: Does my code tolerate errors during data import?

Develop with a defensive approach. That is, we need to test and tolerate these malformed files. There are several strategies. Here are a few we propose.

A specific exception is often raised. The code can then at least write a warning to the logs.

It can also move the incorrect files to a dedicated directory, somewhat like the dead-letter queues in messaging technologies.

Another strategy involves “committing to disk.” This means using a specific extension during the copy process and only at the last moment switching from the old version to the new one. It’s not that simple. Here’s an example implementation scenario:

When copying the file toto.csv, copy it under the name `toto.csv.download`.
When reading the file, only consider files with the *.csv extension, ignoring the *.csv.download files, which are only temporary.
Once toto.csv.download has finished copying, rename toto.csv to toto.csv.old, then rename toto.csv.download to toto.csv, and finally delete toto.csv.old.
At the beginning of the import process, search for any *.csv.old files. These are files where the renaming process did not complete. If any are found:
Check if there is a toto.csv file with a date newer than the toto.csv.old file. If so, delete the toto.csv.old file.
If not, rename toto.csv.old back to toto.csv, effectively performing a rollback to the last known good file.

This scenario is viable only if the file system ensures the atomicity of the rename operation. This implies that once a file is renamed, the change is instantly visible to everyone. However, not all file systems provide this guarantee! (Be particularly cautious with FTP, for example.)

With cloud storage, files are only visible once the upload is complete, so this process is simpler.

Do you manage the lifecycle of chunks?

Another question you need to ask yourself to verify the quality of your application is: What happens during the update of chunks?

The developments we encounter are generally very naive. They assume that nothing can go wrong, that everything always works perfectly. And that: “in case of a problem, it’s not a big deal, you just need to restart the process.”

What will inevitably happen in production? Temporary errors:

Saturation of the number of connections to the database.
The network can be temporarily overloaded and has no other option but to drop packets. Everything eventually returns to normal, but after several seconds without communication between components. This will trigger timeouts or significant slowdowns.
Middleware crashes (VM shutdowns by cloud providers, abrupt stopping of containers by Kubernetes or other orchestrators).

Unresolvable errors:

Incorrect file formats
An explosion in memory consumption (file imports typically involve loading everything into memory, then duplicating the memory consumed by splitting each file into chunks, before injection into the vector database, usually involving a third memory copy). If the code is not memory-efficient, good luck re-importing a large document database!

The naive solution to these problems simply involves replaying the batch, resulting in duplicates after the new attempt. And we’ve seen the impact of duplicates on response quality. This doesn’t resolve unresolvable errors, and in the case of temporary errors, it degrades the vector database. Indeed, identical fragments are then present multiple times. As we’ve seen before, responses degrade without anyone finding a link between cause and effect.

Another approach is to destroy the vector database and re-import everything. Be cautious of memory usage in this case, and consider managing service unavailability while the database is back online. This is often a bad idea, especially if you invoke a large language model multiple times for each document during the import. The bill will be hefty.

Langchain provides the index() API to keep track of the link between documents and their chunks (calculating hashes of the content associated with the vector ID in the database). Thus, only documents that have been modified are refreshed in the vector database. All chunks associated with a document are deleted, then new chunks are injected.

Question: Does the association between the vector database data and the traceability of the different chunks remain stable if there is a crash during the import? What to do if a crash occurs? What is the transactional guarantee of data import between the vector database and the SQL database that maintains the link between documents and vectors?

To our knowledge, there isn’t a vector database compatible with two-phase commit, which would guarantee that in case of a crash, both the vector database and the SQL database would be rolled back.

What we desire is that during import, all imports are either validated simultaneously, or none are.

Another similar problem arises during the saving of conversation history. The SQLChatMessageHistory class doesn’t guarantee that the entire exchange will be saved in the same SQL transaction (v0.2.0). If a crash occurs between saving the question and the answer, you end up with an unstable session (not to mention the subpar asynchronous implementation).

Pgvector seems like a promising integration avenue, but it comes with other issues that we’ll address through PRs and a forthcoming article. For instance, it doesn’t implement an asynchronous API.

Does the LLM invocation support multiple users simultaneously?

LLM projects use GPUs to predict the next words. Unlike CPUs, GPUs aren’t as easily shared, particularly with language models. These models are initially designed to handle only one inference at a time.

If you’re using cloud provider instances like Bedrocks or Vertex, OpenAI APIs, Claude, Mistral, etc., you don’t have to worry about this. However, if you want to manage model hosting, whether on-premise or in the cloud, with dedicated GPUs, you need to consider the quality of your model’s exposure.

A naive developer might simply use an LLM locally for development. When they would like to offer an API (usually via FastAPI) for the LLM, they realize that only one invocation is possible simultaneously. If there are multiple users would like to use the LLM, a queue must be maintained. If generation involves plenty of tokens, you’ll have to wait for each invocation to finish before receiving the first token of the next job.

To improve upon this, the first idea is to create micro-batches with a few inference requests that can be combined without exceeding the token limit. In this scenario, the longest processing time must be completed before launching the next micro-batch.

An enhancement is then proposed to allow injecting a new request as soon as one of the processes is completed within the micro-batch. This is akin to a streaming micro-batch. Thus, model utilization becomes more efficient.

These strategies are complex as they require careful management, potentially even modification of the model, and monitoring of memory between the CPU and GPU to implement. Fortunately, open-source projects like vLLM or others handle all of this. You would then deploy an LLM model alongside your application. Google’s Vertex handles exposing your model via vLLM (the older version of the vLLM API currently).

Question: Does the LLM handle multiple requests simultaneously?

If not, seek elastic solutions capable of optimizing GPU usage for this type of usage. Avoid embarking on manual development.

Do you only use the async versions of APIs?

It’s important to understand that Python is not a multitasking language. It’s incapable of executing multiple streams simultaneously. Only one Python instruction is executed at a time, regardless of the number of cores or threads you use. This is controlled by a lock called the GIL (Global Interpreter Lock). Why this constraint? Among other reasons, Python uses reference counting to manage memory, and this strategy can’t be used with a multicore processor due to each core’s caches.

When Python is asked to handle multiple threads, it employs two strategies. All APIs using I/O release the GIL, allowing other threads to resume execution. Python’s byte code interpreter then switches processing flow, but still executes only one instruction at a time. The second strategy involves timing how long a Python processing flow is given. If it exceeds a threshold between two Python instructions, the interpreter can switch to another thread. It’s worth noting that each thread consumes memory to maintain a stack for calls. The number of active threads is limited by the size of memory needed for each execution stack.

To improve this and be able to handle thousands of users, Python offers the async/await framework. It’s a framework integrated into the Python compiler for dividing each async method or function into slices. Each slice is delimited by the keywords async and/or await. The compiler converts your method or function into a set of coroutines (asynchronous functions) and organizes the sequence of calls. A single thread listens to a queue of coroutines, with each code fragment to be executed. This queue evolves depending on whether the expected results arrive or not. Upon resolution of a future (upon receiving the object awaited by await), the next code block is added to the coroutine queue. It will be processed by the single thread when it’s its turn. This thread is then capable of handling thousands of users (similar to what Node.js does, for example). There’s no longer a need for a thread pool, which is necessarily limited in size.

This framework requires fine collaboration from the developer to enable the management of numerous users. This means that the developer must fulfill their part of the contract for it to work. In all async code slices, there should be no network calls, disk calls, or blocking GPU calls. Additionally, CPU should be relinquished as quickly as possible.

Most web servers in Python are ASGI (Asynchronous Server Gateway Interface) compatible. It’s a standardized link between an HTTP protocol and a Python application. This includes frameworks like FastAPI, Flask, and many others. The key term here is “Asynchronous.”

Here’s what to do during development if your code is intended to be exposed via ASGI:

Declare functions or methods as async from the first line of code if they involve disk, network, sleep(), or GPU usage.
Avoid using libraries that are incompatible with async. Look for alternatives:
boto3 => aioboto3
Postgres: connect => async connect
request => aiohttp
sqlalchemy => sqlalchemy[asyncio]
sleep() => asyncio.sleep
Use pytest-asyncio for testing.

Indeed, many developers construct their processing pipelines in a notebook and then, with just a few lines, publish the code into an API. With this naive approach, only one processing task will be executed regardless of the number of users.

If you don’t adhere to the framework, only a single processing stream will be utilized to execute each block of code, one after the other. If one of them holds the execution, everything gets blocked. Imagine a temporary network issue causing a 10-second delay during the invocation of your LLM. No user progresses anymore. Everything gets stuck. It doesn’t matter how many instances of LLM you’ve activated. The same applies if there’s an issue with the disk or any other degradation in the infrastructure.

Our advice for fully grasping the impact of an error: choose a function from your API and add a time.sleep(120 (careful, not an await asyncio.sleep(120)) and try to use your code with multiple users, one of whom is on the patched API. Make sure to use only one worker for a more instructive test ( — workers 1). You might experience some cold sweats.

In practice, launching FastAPI with default parameters will start as many Python instances as there are vCores in the processor (each with its own GIL and GC, no shared cache between Python instances). So, this hides the limitations mentioned (2 or 4 simultaneous treatments in as many Python instances). However, these limitations become apparent as soon as the number of simultaneous users exceeds the number of vCores.

The question you need to ask yourself is: Does the code exclusively use the async APIs of all frameworks? This applies to database drivers, vector database APIs, API invocations, embedding or LLM computation invocations, and so on.

Langchain offers two sets of APIs within the same framework. This isn’t just for aesthetics, but rather out of necessity. All methods are duplicated. One is synchronous, the other is asynchronous and should be used systematically if your code is to be exposed via an ASGI server.

From experience, it’s rare to encounter proper code regarding this, and sometimes it can be quite challenging. For example, the Langchain pgvector driver doesn’t offer an asynchronous API… (We’ve proposed a patch to make pgvector asynchronous).

What do you use for the user interface?

Most chat-type demos use frameworks or components that allow for quickly proposing a user interface. Examples include Streamlit, Gradio, and Chainlit.

Issue: These frameworks are generally incompatible with the async/await framework, leading to the same problem as before. (Note that they also present other issues related to security, resilience, etc.)

Question: Is the user interface used compatible with async?

If not, consider hunting for other solutions or implementing the web interface yourself, which simply invokes an API.

Why does no one notice?

As you can see, there are many bad practices that are too often found in these applications. You might recognize yourself in some of these situations. But then, why do so few developers react?

LLM projects are new and have high execution costs, so currently, they are accessible to only a small number of people. This limited access conceals many of the challenges. As these projects gain success, their code will likely need to be thoroughly revised.

If I were an MLOps engineer and unable to guarantee the resilience of applications, I would likely refuse to put some into production. Given Murphy’s law, the worst-case scenario is bound to happen eventually.

It’s time to take charge of the resilience of applications.

For your generative AI projects, ask yourself these different questions before going into production:

How does the temporary directory evolve when using the application?
Does the code tolerate errors during data import?
Is the code protected against the addition of identical vectors?
Does the association between the vector database data and the traceability of different chunks remain stable if there is a crash during import?
Can the LLM handle multiple simultaneous requests?
Does the code use only the async APIs of all frameworks?
Is the user interface used compatible with async?

You need to pay attention to all the bugs that don’t occur during development. This might be how one becomes an “expert.”

An upcoming article will offer solutions for the LangChain framework. These solutions are not simple, as LangChain in its current state (version 0.2.0) does not allow for a resilient application! Nevertheless, we will provide you with solutions.

We hope you are not in these situations. 😉