USE CASE: Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint — Part 2
Previous blog post from Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint series:
Part 1
Introduction
In the previous blog post in this series, we discussed the motivation behind building the pipeline mentioned in the title, the initial proposed idea, the initial architecture and all of the particular caveats tied to each part of the architecture. All of these are crucial in understanding how the pipeline operated, not just in that initial version but, in most ways, how it kept operating throughout the different versions.
In this blog, we’ll cover the first few version changes we made to the initial pipeline and the reasoning behind them. Stuff we tried and kept, as well as stuff we tried but ended up not using.
So, without further ado, let’s get started!
2nd Version — Making the UI Serverless and Switching from Cloud Agnostic
The initial idea for the UI was very clear: let’s create an external Load Balancer Kubernetes service, which will serve as the singular exposed element of the pipeline to the users, through which the pipeline will facilitate all of the input prompts and return successful results.
In theory that approach could work but is there a better one?
Although we would want the UI to always be available, it does not necessarily need to be ALWAYS available, meaning that the UI should only really be available when a user needs it. Essentially, the UI seems to be more tailored towards a serverless approach.
Serverless means that you pay on a per-usage basis. You’re still occupying the servers of the company which is supplying you with the backend services but you are only paying for usage and not the number of servers used.
This means the cost of the service fluctuates depending on the number of users.
When we established that, the next question was finding a service that could provide serverless capabilities, while also being able to scale when the need for it arises. This also meant that we would have to move away from our initial cloud agnostic approach and choose a cloud provider which could satisfy the current and potential future needs of the pipeline itself.
Based on the requirements mentioned prior and the team’s familiarity, we decided to opt for the Google Cloud Platform (GCP) as the cloud provider and its Cloud Run as the service to host our UI.
Cloud Run boasts all the requirements we had to make our UI fully-functional:
- Serverless.
- Autoscaling instances when the needs increase.
- Works with Dockerized Python applications.
Cloud Run offers two ways of working with your code:
- Cloud Run Services — your code runs continuously, responding to web requests and events.
- Cloud Run Jobs — your code runs once and quits once the work is done.
It is clear that for our purposes we needed to use a Cloud Run Service.
The biggest defect of this new approach was losing the ability of being cloud agnostic and easily deployable to any cloud provider.
Our logic behind that decision was that since all of the major cloud providers offer some sort of serverless service, similar to GCP’s Cloud Run, that it should be technically fairly easy to mimic what has been done on GCP to another cloud like AWS, or Azure. We could use the knowledge we gathered around Cloud Run and have an easier time of finding that cloud-specific service on another cloud provider when necessary.
An additional reason for that was the fact that if we were to ever deploy a similar solution to a client’s project, they would most certainly want to leverage one of the major cloud providers for those services and then that could help us better understand and deliver the pipeline for their specific needs.
Once the changes were applied to the pipeline, our new architecture looked something like this:
Attempt to Host the Vector DB on Cloud Run
Once we started hosting our UI on Cloud Run and the idea of serverless hosting of services arose, we made an attempt to move our vector database to Cloud Run as well.
Just like the UI, the vector database also does not need to be available at all times: only during the load of documents + whenever a user sends an input prompt. Technically, it would only end up working slightly more than the UI. The only thing that it would need to be doing during the times when it is not processing a request is keeping a volume of sorts, where all of the database info would be stored. Additionally, if it can spin up a Dockerized web application, why would it not be able to spin up a database instance as well?
The idea sounded sound enough. Unfortunately, those dreams were quickly shattered.
Initially, the problem was having a proper way to persist the state of a database, since Cloud Run Services are in their nature stateless. We did find an initial workaround for that, because for 2nd generation Cloud Run Services, there is such a thing as volume mounting. This would entail mounting a Cloud Storage bucket to the Cloud Run Service where the data could be persisted. This, in our eyes, essentially made it stateful. But that is not unfortunately true, due to an aspect of the Cloud Run Service which we mentioned prior: autoscaling instances.
The server instances allocated for incoming requests to a Cloud Run Service can come and go, but not all requests will go to the same instance. More importantly, that means that not all clients will see the same data, which makes the whole ordeal not ACID-compliant and impossible to use for hosting a database.
This is where our efforts to use a Cloud Run Service for the vector database stopped and we decided to stick with the Kubernetes Stateful Set instead.
3rd Version — Adding the Document Hash Database
At this point, we were pretty happy with our architecture and all the components seemed to follow instructions properly. Then, after quite a bit of testing, we noticed our context was not being served properly. To be more specific, after the initial load and a few scheduled runs, the context being served would simply start to malfunction.
When we went to check the vector database, we quickly noticed that a large amount of the data was being overwritten and removed, with just some of the points persisting.
What we failed to initially account for, was a way for the Confluence Loader Cron Job to only load additional data that is new and not overwrite existing data.
The following needed to be covered by the Confluence Loader Cron Job:
- No duplicated values should be allowed in the Qdrant vector database.
- If a Confluence page gets moved to another place in the Confluence space, the pipeline has to recognize that it is the same page which was already loaded and not load it again.
▹ If a Confluence page gets edited, make sure to recognize that it was edited and change the chunks stored for that page.
Luckily for us, the Langchain framework provides a solution just for such a problem: the Langchain Indexing API.
The Langchain Indexing API essentially lets you load documents and keep them in-sync from various sources into a vector database. Its main usage is to:
- Avoid writing duplicated content into the vector database.
- Avoid re-writing unchanged content.
- Avoid re-computing embeddings over unchanged chunks.
As we can see, this covers all of our loading needs, while saving the system compute time.
The Indexing API enables this by making use of a so-called record manager. This record manager is the one that keeps track of all the chunks written into the vector database.
The jist of how this works is that the content which is being loaded into the vector database, simultaneously gets indexed and the information regarding the indexing gets stored in the record manager.
For the purposes of indexing, the following information gets stored:
Since there could be multiple collections in the vector database,
namespace
field allows you to load same chunks in multiple collections without marking them as duplicates.
The record manager itself is being hosted as a PostgreSQL deployment in a Kubernetes Stateful Set, alongside the current pipeline in the same Kubernetes Cluster. The values mentioned prior get stored in a singular table, called upsertion_record, inside a newly created single schema of the PostgreSQL instance.
Due to the fact that the table stores hashes of data and uses them for comparison between content, we have appropriately decided to name this additional database the Document Hash DB and will be mentioned as such in the rest of the text here.
Although this might seem like an overreaction, to have an entirely new database added to the architecture just for the purposes of getting rid of duplicates and unwanted content, this new database does allow for additional uses, which could be added to the future roadmap. Some of the more memorable ideas include: adding the concept of users to the system, keeping track of the user’s history, as well as the ability to add additional sources and their own indexing, which will be further discussed in the final blog of the series.
With these new changes, the architecture looked like this:
Deletion Mode of the Document Hash DB
A lot of the times the information which gets loaded into the vector database and indexed by the record manager, becomes irrelevant or simply outdated. In times like these, it is important to understand that choosing on which information to keep and which information to delete and when, is as important to the pipeline, as the quality of the data itself.
Luckily, the Indexing API provides various cleanup options for exactly that:
Considering that in our case we would like to clean up all the documents which have been removed from the Confluence knowledge base and all the ones which were edited in order for their replacements to be added, we opted for the full
cleanup mode.
Conclusion
That is the version we had after the first two variations of the initial architecture. Although in a much better spot than with the initial architecture, there was still a while to go before something we could call production-worthy.
Join us in the following blog to further discuss what our final improvements were and the final version that we created and find out what some of our immediate and long-term plans for the pipeline developed are.
References
- Service
- What is serverless computing? | Serverless definition | Cloudflare
- Cloud Run
- Is my app a good fit for Cloud Run? | Cloud Run Documentation | Google Cloud
- What is Cloud Run | Cloud Run Documentation | Google Cloud
- Cloud Storage
- What Does ACID Compliance Mean? | An Introduction | MongoDB
- Indexing | LangChain
- Deletion modes
Next blog post from Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint series:
Part 3
Originally published at https://www.syntio.net on October 16, 2024.