USE CASE: Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint — Syntio

Syntio
SYNTIO
Published in
18 min readOct 8, 2024

Motivation

With the ever-growing popularity of GenAI, over the course of the last two years, we decided to dip our toes in and try out this new and exciting field. Due to the current AI bubble and the staggering amount of new startups working with LLMs popping-up almost every day, it seems impossible to find a niche in this oversaturated market that could make you stand out (article where you can read more about the oversaturation in AI — Has AI oversaturation already killed the opportunity to stand out?).

With that in mind, our initial idea was to cover the two following requirements:

  1. Find an industry-grade project in the LLM landscape — Take an existing idea for an LLM-powered pipeline or architecture that has been proven to be useful for other companies dabbling in this field. This is mostly due to our initial lack of experience and trying to rely on proven projects by major companies.
  2. Customize it and make it your own — Build on top of that idea and find areas for improvement. Essentially, ask ourselves “Where can we, as a Data Engineering firm, build upon this idea, whilst relying on our own internal data and areas of expertise?”.

Introduction

Building on top of the two previously mentioned requirements, we landed with the following ideas:

1. Out of all the ideas out there available when working with LLMs, the one that pops up the most and has been well-proven is the RAG method.

  • RAG, or Retrieval-augmented Generation, is a method of further improving the output of an LLM model with the use of context. Essentially, the idea is to “feed” the LLM model contextual information about the current question during inference and in such enable the LLM to give a more precise and accurate answer based on more precise domain information.
  • There are various ways of achieving this method but the consensus idea about how to do this is to: gather domain information (from various sources), store the information (most likely in a vector database), retrieve the information during inference and reuse the information for LLM response generation. You can see an example of this on the image below:
Simple RAG Workflow (source: Building RAG with Open-Source and Custom AI Models)
Simple RAG Workflow (source: Building RAG with Open-Source and Custom AI Models)

In order to better understand the RAG method, it is important to understand vector databases.

As opposed to regular relational databases, which store various collections of tables and organized sets of data in a tabular format, a vector database stores and indexes vector embeddings.

This means that for each new piece of information, the input gets split into chunks. These chunks are then turned into embeddings (vector representations of chunks) and finally stored in the vector database.

These vector embeddings enable fast retrieval and similarity search, with additional capabilities like CRUD operations, metadata filtering, horizontal scaling and the ability to work in a serverless manner.

The RAG method is a well-known problem-solver in the GenAI landscape and it can offer additional and much needed support to regular pipelines which depend on the inference of an LLM model, due to the fact that it does not require any additional fine-tuning of the model.

The usual way of purposing the RAG method is by creating a UI in the form of a chatbot, where users can ask questions and the LLM, with the help of context, would answer them. People are familiar with chatbots, people are used to chatbots, so we decided very early on that adding the chatbot is a crucial part of the whole architecture.

Once we settled on the main idea, the next step was thinking of improvements.

2. Although RAG is a proven method, there are aspects of it which could be further improved or areas where a different approach could be taken. Most RAG solutions rely parts of their pipeline either on various cloud-specific services (e.g. VertexAI platform on GCP) or external services (e.g. OpenAI for infering the models themselves) during their runtime. We wanted to change that and make the whole system as cloud agnostic as possible. With that in mind we decided to propose the following changes:

  1. Self-host the pipeline on Kubernetes — create a Kubernetes cluster where the pipeline will be hosted. This will in turn make the whole system a lot more flexible deployment-wise, meaning it could be hosted on all the major cloud providers, as well as locally.
  2. Self-host an open-source LLM — most of the pipelines rely simply on infering an existing OpenAI model. We wanted to change that by making use of an open-source model and deploying its image as a service, as part of our aforementioned Kubernetes cluster.
  3. “Feeding” the LLM with our Internal Knowledge Base (“The Data Engineering Gold Mine”) — even-though the whole idea of the RAG method is feeding context to the LLM, Syntio’s internal knowledge base (more specifically, our collection of Confluence spaces) is not just any domain context. It carries in-depth information needed for the mastery of the Data Engineering field, alongside a copious amount of information about various DevOps, SysOps and Software Engineering tools/frameworks. It was created by our very own engineers and is constantly being maintained. This makes the proposed system incredibly valuable for any incoming, not just Data Engineers, but engineers in general that might use the chatbot.

Although the proposed solution might not seem groundbreaking, it felt as a good start for getting to know the LLM ecosystem, all the while making a RAG-powered LLM solution, which is further improved by the level of our expertise in the area of deployment management and the known quality of our data.

We will now go over the initial development plan and the proposed architecture we started with. Once you get a good idea of how this was planned to function, the following blogs will then explain the enhancements and changes we did to the architecture itself and the code itself, to enable this to work at a high-speed at which it is operating now, as a finished product.

Architecture

The central component of the developed pipeline is the LLM itself. For the purposes of this solution, the initial LLM framework in our case was HuggingFace’s TGI — Text Generation Inference (text-generation-inference (Text Generation Inference)) and the model we used was the 7 billion parameters Mistral model, more specifically the mistralai/Mistral-7B-Instruct-v0.2(mistralai/Mistral-7B-Instruct-v0.2 · Hugging Face).

The decision to put the focus on the aforementioned model was due to several factors: highly performant, efficient inference, suitability in real-time application, but most importantly due to its compliance with GDPR (General Data Protection Regulation), making it the perfect model to be used for the purposes of enabling our solution in other company’s ecosystems, without compromising their sensitive data. Even-though the initial idea was to build this pipeline on top of our own data, we also wanted to leave the option of making the pipeline production-grade if possible, which is why the choice of model was crucial.

The TGI framework was initially chosen on account of addressing the core challenges when needing to deploy LLMs, such as resource optimization and scalability. Par that with the fact that it has an extensive documentation online, it made for the perfect framework needed to guide us in the initial part of the LLM journey.

The project’s architecture was designed to support two distinct flows, each serving a specific purpose:

  1. Storing the data needed for context in the vector database (Datastore Flow).
  2. Taking questions from the chatbot end users, combining them with the context, infering the LLM and returning the answer (Prompt Flow).

These flows work in tandem to provide a comprehensive and efficient solution.

Datastore Flow

The Datastore Flow is responsible for creating and populating the datastore, more specifically the vector database. This essentially lays the foundation for the entire system to function in a RAG-manner, by feeding the system much needed context. The architecture of the flow can be seen on the image below:

Initial Datastore Flow

The Datastore Flow consists of:

  • Kubernetes Job used for loading the data from Confluence to the vector database.
  • Kubernetes Stateful Set, hosting an instance of the vector database itself.
  • Connector to the internal Confluence spaces (this is not a standalone component, more of an integral part of the Kubernetes Job used for loading the data from Confluence)

A quick run-through of how the flow works:

  1. Initially, the Confluence Loader job creates a connection to the appropriate Confluence spaces, using the Confluence API token, where we want to pull the data from.
  2. The newly created connection then pulls all the data from those Confluence spaces in the form of documents.

1 document = 1 page in a Confluence space

  1. These documents then get split into chunks.
  2. Once split into chunks, these chunks get turned into embeddings.
  3. These embeddings then get pushed into the appropriate vector database collection, where they will be stored permanently and used for context during the Prompt Flow.

That is essentially the entire jist of the Datastore Flow. There are a few additional caveats that are necessary to cover in order to better understand how this works.

Chunking strategies

When we talk about chunking any large amount of data (like a large paragraph, an entire document, etc.), we need to consider the different types of chunking strategies. These strategies decide the size and number of chunks that a loaded piece of data can create.

They are necessary for the pipeline, since they enable the embedding model, to create vector representations of the text itself. Due to the size limitations imposed by most embedding models, they constitute a fundamental piece of the RAG pipeline.

Chunking strategies are numerous and by relying on the Langchain framework we were able to test out a few: RecursiveCharacterTextSplitter, NLTKTextSplitter, MarkdownTextSplitter and SpacyTextSplitter.

After some initial testing, due to the fact that the NLTKTextSplitter and the SpacyTextSplitter are both token-based splitters (chunkers) they proved to be too granular for the task at hand. The decision was to proceed only with the RecursiveCharacterTextSplitter and the MarkdownTextSplitter, since they worked in a more suitable manner, by splitting the pieces of text and trying to keep related pieces next to each other, with each one having their own ways of splitting:

  • RecursiveCharacterTextSplitter — recursively splits text on a list of user defined characters, in our case using the recommended default characters: "\n\n", "\n", " ", "".
  • MarkdownTextSplitter — splits the text along Markdown-formatted headings.

HuggingFace — Embedding Model

In order to store vector representations of unstructured data (such as the data stored on Confluence, but also audio, images, videos, etc.) and use vector databases, one must use an embedding model. An embedding model is nothing else but a ML model which creates those vector representations from the input data. These embedding models are pretrained so that they can simply output an answer, a vector representation, when needed for the given input.
In the case of our pipeline, it works exactly like that, except that we don’t do that step of creating embeddings ourselves, rather we pass the embedding model as an argument when initialiazing the vector database connection. The database then creates the embeddings for the input data (in our case, our newly created chunks) itself.

When it comes to choosing an embedding model, the choices are endless. During our testing, we mostly based our decision off of the Massive Text Embedding Benchmark (MTEB) leaderboard (https://huggingface.co/spaces/mteb/leaderboard). In short, this leaderboard provides a comprehensive evaluation of various embedding models across multiple tasks and datasets. In the context of a RAG system, like the one we were building, the 'Retrieval' aspect of the MTEB was the most important. It basically assesses how well an embedding model can fetch relevant documents from a large corpus, which directly correlates to its utility in a RAG system. Higher retrieval scores means a model is better suited for a system where accuracy and relevance of retrieved information is critical.

Studying the leaderboard, seeing how it fluctuates over a span of a week, led us to narrow down the search for the embedding model down to these four:

These were deemed to be potentially the best for our use-case, based on the size of the Mistral model itself and the appropriate vector size of vector embeddings.

It is also visible from their respective links, that these were all taken from HuggingFace.

HuggingFace is an enormous online AI community, serving essentially as a repository of models, datasets and different applications, all applicable for AI. Their selection of different models is considered to be best by industry standards, which is why it served as a starting point when looking for the perfect embedding model for our pipeline.

To test these, we used our own specialized set of questions, tailored to our internal knowledge base. The metrics we followed were simply the time to reply and the accuracy of the answer.

Based on the testing we did, the model that stood out the most for us, as being the most reliable, or accurate, when it comes to answering the questions, all the while taking the least amount of time was the all-mpnet-base-v2 model.

It is important to note that this same embedding model is used for storing the context into the vector database and for retrieving it when necessary for infering the model.

Qdrant — Vector Database

When it comes to vector databases, similarly as the embedding models, there are numerous options. A lot of that has to do with the recent boom in GenAI popularity, causing the creation of a lot of companies, specializing in creating their own, “better” version of a vector database.

Due to those reasons, we decided that the best option when choosing a vector database would be to try and focus on using one that has been proven to be a good candidate for the job at hand. This led us to the following page: Vector DB Comparison. This page gives a comparison between most of the currently available solutions for ones vector database needs.

Based on this table, the opinions of experts from various tech forums, the sheer amount of documentation and trust built by the AI community and its investors into the product, we decided to opt for Qdrant.

Qdrant is a Rust-based vector database, which stores its vector embeddings as so-called points in collections. It comes with its own UI, where you can easily view your own data through the console tab (where you write requests to fetch your data) or through the collections tab (where your data is displayed as objects which can be viewed for more detail), as well as learn more about how to query it by viewing the tutorial tab. This made Qdrant easy to navigate in the early stages, without facing any hiccups the more we used it.

Welcome page of the Qdrant UI

Prompt Flow

The Prompt Flow focuses on handling user prompts and generating appropriate responses based on the information stored in the datastore. This essentially covers the entirety of the infering process of the LLM, as well as ensuring the infering is context-based when needed. The architecture of the flow can be seen on the image below:

Initial Prompt Flow

The Prompt Flow consists of:

  • Four Kubernetes Services
    - One hosts the UI and propagates the user requests to the rest of the pipeline.
    - One generates an embedding from the user prompt and searches for the similar embeddings in the vector database (VectorDB Service).
    - An instance of the LLM itself (Mistral).
    - The central component which communicates with the other three Kubernetes services and propagates requests or responses to each of them (QuestionAPI).
  • Kubernetes Stateful Set, hosting an instance of the vector database itself (the same as the one mentioned here).

A quick-run through of how the flow works:

  1. Initially, the user inputs the question prompt through the UI chatbot.
  2. The UI propagates the prompt to the Question API service.
  3. The Question API then propagates a new request to the VectorDB service to retrieve context.
  4. The VectorDB service generates the embedding of the initial user prompt and sends a request to perform a similarity search for similar embeddings in the Qdrant database.
  5. The Qdrant database returns the most favorable matches, if it finds any useful ones.
  6. The VectorDB service returns the favorable vector embeddings, in the form of strings to the Question API.
  7. The Question API then combines the returned context and the initial prompt and propagates it to the LLM.
  8. Once the LLM returns a response, the Question API service then finally propagates the response to the UI.
  9. The UI then displays the response to the user, based on the initial prompt and context.

It is important to note here that the entirety of the pipeline can function without context, meaning the steps from 3 to 6 are skippable if the user chooses to only query the LLM itself directly, meaning the bare open-source Mistral model.

This is ONLY recommended in situations when being asked questions where domain knowledge is not useful for the answer.

That is, in short, how the Prompt Flow works. As with the Datastore Flow, there are additional caveats which require further explanation to fully grasp the technology behind this.

FastAPI + Uvicorn — Question API & VectorDB Service

The core idea of the pipeline is to essentially combine the user prompt with the domain context and infer the LLM. This could technically be done with only 2 Kubernetes Services:

  • One that receives user requests, fetches context, infers the LLM and displays the answer.
  • One that only hosts an instance of the Mistral LLM.

Since that would require the UI Kubernetes Service to get a lot more complicated than it needs to, while at the same time making sure it juggles multiple tasks at once, we decided to opt for a more microservice approach. This entailed that we split the aforementioned UI component, into three separate Kubernetes Services:

  • One that handles the user prompts and displays the answers (baseline version of the UI)
  • One that handles fetching the context (VectorDB Service)
  • One that handles coordination between the Kubernetes services, as well as infering the LLM with/without context (Question API)

These two newly added components become two new, fully-fledged Kubernetes services. In order to make them communicate with each other, each becomes its own Dockerized API application.

Since all of the Kubernetes Services (apart from the LLM itself) are written in Python, the best option for building APIs in Python was choosing the FastAPI framework.

FastAPI is a high-performant web framework for building APIs in Python. The reason why it performs better compared to other Python frameworks which could be leveraged for building APIs, like Flask for instance, is the fact that it is an ASGI (Asynchronous Server Gateway Interface) framework, based on the Starlette ASGI microframework. FastAPI is essentially a sub-class of Starlette, which means it cannot be faster than Starlette but it does add additional features to it like using the Pydantic Python library for data validation and serialization. This alone can save engineers quite a bit of time when developing, which is why it is considered the much more favorable option of the two and has become an industry standard in itself.

The reason why Starlette is considered so fast is the fact that it relies on the Uvicorn server. The Uvicorn server is the high-performing and minimal low-level server interface for asynchronous frameworks in Python. It stands out as the standard for most use-cases which require a server for Python applications, due to it simply being the fastest and most reliable option.

You can look at it this way: Uvicorn is the basis of Starlette and Starlette is the basis of FastAPI.

That, in a sense, is the reason why FastAPI has gained so much popularity over the last few years and why it caught our eye for the pipeline we were trying to build.

Apart from that, FastAPI’s ease of use only stands as a bonus and a warm welcome for engineers who only started working with web frameworks in Python, without the unnecessary confusion from the get-go.

VectorDB Service Startup
Question API Startup

When a FastAPI application is starting up, it essentially runs a Uvicorn server on the designated port. That’s what we mean when we say: Uvicorn is the basis for all of this.

Streamlit — LLM UI

When it came time to decide what type of UI we wanted to build, we knew we did not want to spend too much time racking our brains on setting up a simple and straightforward UI application:

  • We knew we wanted a simple chatbot.
  • We knew we wanted something Python-supported.
  • And we knew we wanted something which is easily runnable in a Dockerized environment.

That’s when we stumbled onto Streamlit.

Streamlit is an open-source Python library built for creating simple UIs in Python, with a fraction of the code that the usual web development libraries require. With Streamlit, you can build a UI in a few lines of code. All you have to do is import the Streamlit library and just start adding the elements you need in your application.

Streamlit offers a wide range of widgets, all of which your standard web application might need. Apart from the standard selection of web elements: check buttons, radio buttons, toggles, etc., Streamlit also offers the user the ability to create custom components, meaning any element written in HTML can easily be added to Streamlit’s UI.
It also offers the ability to store values for the various elements into a session-based key-value variable called st.session_state. This way Streamlit easily performs transitions on the UI based on the current state of the elements for the current session.

Apart from the fact that it offers all that we needed for the simple UI we wanted to create, it has also gained quite a bit of popularity in the GenAI community for the exact reasons mentioned above: it is the best Python library for creating simple and easily usable UIs.

Based on the aforementioned information, that should come as no surprise.

Simple Streamlit UI Chatbot

General Architecture

Now that we have described the entirety of the particular flows of our architecture, covered all the details, let’s combine them into a singular architecture and see where they intertwine. The general architecture of the pipeline can be seen on the image below:

Initial General Architecture

From the image above, we can see that the only intertwining point of the two flows is the Qdrant vector database but even that is used for different purposes in the different flows:

  • Datastore Flow — writes into Qdrant’s collections to store the documents coming from Confluence.
  • Prompt Flow — reads Qdrant’s collections in order to perform the vector similarity search and find relevant vectors for contextual purposes.

Considering that these two flows don’t necessarily impact each other directly, they can work in the manner that would be optimal for them:

  • Datastore Flow — doesn’t need to be running all the time; it can just periodically, every 12 hours or so, add any new information it found on Confluence and store it in the Qdrant database; for that reason we decided for this flow to work in a scheduled manner.
    That’s why the most important part of the Datastore Flow, the Confluence Loader, can work as a scheduled Kubernetes Job.
    To be more precise, it actually consists of two separate Kubernetes Jobs:
    ▹ One for the initial load which gets performed immediately once the entire pipeline gets deployed to Kubernetes and closes when done.
    ▹ Another one which works as a cron job and in a scheduled manner checks for any updates to the Confluence spaces we follow and propagates those updates to the Qdrant database.
  • Prompt Flow — needs to be running all the time; the idea for the chatbot UI is for it to be up-and-running whenever a user might need to ask it a question; for that reason it operates in a continuous manner.
    — That’s why all of the Prompt Flow Kubernetes Services are up-and-running at all times.
    — The Kubernetes Cluster is scaled up when initially deploying and the idea is for it to stay deployed as such until it is terminated.

Conclusion

That is the initial architecture and pipeline we proposed for our RAG-powered solution. Although this solution seemed to be quite mature at the time, we still had a long way to go until we could reach something that could be considered production-grade, with various tweaks and improvements still waiting to be found.
So that’s the idea for this blog series, going over how we improved the pipeline and with what, while also mentioning the additions we tried to make but ended up not proving useful at the current time. Ultimately, that will lead us to the final version of the architecture itself which we developed.

References

Next blog posts from Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint series:
Part 2 and Part 3

Originally published at https://www.syntio.net on October 8, 2024.

--

--

SYNTIO
SYNTIO

Published in SYNTIO

Syntio is a data engineering company specialized in data integrations and event-driven architectures in the cloud, trusted by some of the world’s most recognized brands on their journey of digital transformation

Syntio
Syntio

Written by Syntio

The Data Engineering company. Offering knowledge and cloud-based solutions to complex data challenges worldwide.

No responses yet