USE CASE: Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint — Part 3 — Syntio
Previous blog post from Self-hosted RAG-powered LLM solution for Confluence and Microsoft SharePoint series:
Part 1 and Part 2
Introduction
In the previous blog of this blog series, we discussed the first few version changes we made to the initial pipeline and the reasoning behind them. We dove into why we switched from the cloud agnostic approach, to a more cloud-specific approach and why we believe that stands as the right path to easier development of custom solutions in the future, as well as talked about how we added a whole new PostgreSQL database instance to solve the problem of indexing and the duplication of data in the context database.
In this blog, we’ll cover the final changes we made to the pipeline, not just from an architectural standpoint but also the tweaks we made to the LLM serving frameworks used, finally mention the SharePoint connector we’ve been teasing in the title all along and wrapping up this journey we took on and our plans for the future.
Without much delay, let’s dive in!
4th Version — Encapsulating the Pipeline inside a VPC Network
Once we fixed the problems we faced with properly loading additional data to the pipeline, we started to reconsider the safety aspect of our deployment.
Although our deployment was purely hosted on GCP, with most of it being on a Kubernetes Cluster, we wanted to find a way to enclose the entirety of the pipeline, so as to not let any unwanted traffic to even reach any of our pipeline endpoints.
What we wanted to achieve was to move the deployment’s traffic into a private network that would not do any of its communication across the public internet. That’s where we landed on the idea of enclosing the deployment into a Virtual Private Cloud, or VPC. This private VPC can then shrink the size of the network we use, while also removing most regions which are not of use to us by removing unnecessary subnets.
This VPC network would not only provide us with the safety aspect of encapsulating the traffic within a strictly private setting but also enable the following:
- Lower the latency of requests/responses between various components.
- This in turn enables near real-time logging. - Although it closes off the architecture to the public internet it still provides global scope between regions.
- Meaning regardless of the zone, or region, where you deploy, the pipeline is enclosed within the VPC network.
- This is particularly useful for situations where you want to deploy the system in a region that does not have high availability of GPU machines, making it easier to switch for a neighboring region/zone. - Simplified management.
- E.g. a security policy can be applied on a global scale to the entire pipeline.
Setting up a VPC could also serve as the starting point for connecting external databases to the LLM pipeline through the use of a Cloud VPN connector.
This could then easily enable connecting on-prem databases to the proposed pipeline and using their data for context similarly to the way the Confluence connector works.
When realizing the benefits of this, the decision for a VPC network really seemed like the only logical move at that point. With the new addition of the VPC network, our architecture was upgraded to the following look:
Adding the vLLM Framework Support
Around the time that the VPC network was added, we noticed quite a big buzz brewing on blog posts and in the AI media itself about this new framework for deploying and serving LLMs, called Versatile Large Language Model (vLLM).
vLLM is an open-source LLM framework which promises a more efficient way of serving and using LLMs. It promises various benefits compared to other frameworks, like the TGI which we use:
- 3.5x higher throughput than TGI.
- Near-optimal memory usage with wasted space being under 4%.
- Less GPUs are needed to achieve the same amount of output. - When you tie both of these features together, you get reduced inference costs which is the end goal when using such a framework.
The way vLLM achieves this is by leveraging their own personal, completely new memory allocation algorithm named PagedAttention.
To give a brief overview of how PagedAttention works, most inference engines used during the serving of LLMs use only about 30% of the available GPU memory. Inspired by the concept of Paging, which operating systems use to swap out parts of the memory which are not actively used and use them for extra space, a group of scientists set off to create the PagedAttention algorithm.
The main issue of memory allocation they realized when working with LLMs was the variable size of the incoming requests, which were allocating a fixed, unbroken block of space. The key reason for the incoming memory waste was external fragmentation — when a fixed-sized memory block does not match the request sequence length, which leads to unused gaps of memory between multiple blocks.
PagedAttention algorithm worked to solve that very problem of external fragmentation, which in the end, by the results they have gotten, led to the memory usage skyrocketing to about +95%, way above the initial 30%. This in turn leads to the aforementioned reduced costs, as well as to the much higher throughput in general.
If you wish to learn more about PagedAttention, check out the official paper written by the creators of vLLM: Efficient Memory Management for Large Language Model Serving with….
All of this sounded amazing so we decided to test it out for ourselves.
We noticed an immediate drop in the amount of memory usage used by our deployed machines. Although the inference time itself was only somewhat faster, we did realize that in a case like ours, with a smaller amount of users, the difference would not be as noticeable right now and only be visible on a larger scale.
For those reasons we decided to leave both of the frameworks available during deployment and see later on how each of them would behave when the number of users skyrockets.
Final Version — Adding the SharePoint Loader and Quantization Methods
Once we encapsulated the pipeline into a VPC network, added an additional LLM serving framework, as well as ensured that the entire deployment was loading data successfully in an incremental manner, we wanted to turn our attention to other sources and then analyze the pipeline in more detail and find small ways in which we could speed up the entire process of infering the LLM.
Addition of the SharePoint Loader
We figured that adding new sources would be the best way to display, how seamlessly the solution proposed could integrate with an incoming company’s system. When looking for sources to cover, we looked at our own case first: “Apart from Confluence, what is another online service which we use to store sensitive/general/useful information tied to the work we do?”. The answer was very clear — Microsoft SharePoint.
Just like many other firms, a lot of our internal information was being stored on SharePoint, in one of the firms numerous SharePoint sites. The information stored their can be saved in various files formats. This meant that it would be impossible to make a parser for all existing file extensions, since most are allowed upload on SharePoint. So when it came time to decide which files to parse from SharePoint, we landed on the ones with the most importance to us, which were:
- PowerPoint files (.pptx extension)
- Microsoft Word files (.docx extension)
- PDF files (.pdf extension)
- Simple text files (.txt extension)
Although this might seem like a big undertaking at this moment in the development, due to the fairly easily achievable connection towards SharePoint sites through Python, countless existing parsers for the file extensions mentioned prior and the fact that we made adding new loaders to the deployment as modular as possible, we were able to pretty quickly get a SharePoint Loader up-and-running.
This newly added SharePoint Loader works in the same manner as the Confluence Loader, meaning it consisted of two Kubernetes Jobs:
- One for the initial load from the SharePoint site of choice, which gets performed immediately once the entire pipeline gets deployed to Kubernetes and closes when done.
- Another one which works as a cron job and in a scheduled manner checks for any updates on the chosen SharePoint site and propagates those updates to the Qdrant database.
With the addition of the SharePoint Loader, our architecture reached its final form and looked something like this:
Addition of Quantization Methods
Once we were satisfied with the look of our architecture and had no pressing matter which we wanted to attend to regarding the architecture itself, we decided that it would be best to turn our attention inwards and check out the particularities of our deployment, essentially dig deep to try and find areas for improvement.
One of the first places we looked at were the parameters of the frameworks we used for serving and deploying our Mistral model, the TGI and the vLLM framework.
Apart from the few nifty features we found, we also noticed the constant mention of these so-called quantization methods. So we decided to dig further and discovered that these might be exactly what we are looking for.
A full list of parameters for each of the frameworks can be found on the following links:
- TGI — Text-generation-launcher arguments
- vLLM — Engine Arguments — vLLM
Quantization is a compression technique, where you map values from a higher precision to lower precision ones. The simplest example might be mapping a float with 32-bit precision, to a float with 16-bit precision.
For LLMs, this simply means adjusting the precision of their trained weights, making them less memory intensive. The obvious drawback in such a situation is the fact that the model loses on accuracy. That is definitely the most common trade-off, a scenario where you choose between the accuracy and the speed of the response.
The reason why this speeds up inference is the fact that a reduced memory bandwidth leads to increased cache utilization. The reason why it is usually used in scenarios like the one presented here, is due to the fact that the problem we are facing here is a strictly retrieval problem, in a very closed environment, where this lost precision does not affect the quality of the LLMs response to such an extent.
That is quantization in a nutshell. Now for the various methods which enable quantization, there are 2 major categories, which we need to mention:
- Post-Training Quantization (PTQ) — quantization which is performed on an LLM after training.
- Quantization-Aware Training (QAT) — quantization which is performed on an LLM during training.
For the purposes of our solution, we decided to test out different methods for both of the frameworks, which fall under both approaches, since we were not sure of the advantages of each approach. Funnily enough, we landed on a PTQ method called EETQ (Easy & Efficient Transformer Quantization) for the TGI framework, and on a QAT method called AWQ (Activation-aware Weight Quantization) for the vLLM framework.
Both of these methods sped up the inference time 2x for each of the frameworks.
There was no noticeable difference in the way each of these methods worked. The only major difference was in the way each of these gets applied to the framework itself:
- For the QAT methods, a new image for the Mistral model needed to be used, one which had the Mistral model already quantized, due to the fact that the quantization gets performed during training. The quantization parameter of the framework also needed to be set to a supported QAT quantization method.
- For the PTQ methods, no new image was needed for the Mistral model, the same model image could be used that we were already using, only the quantization parameter of the framework needed to be set to a supported PTQ quantization method. This is due to the fact that quantization in such a method is performed after training, or in our case, when the deployment is being set up.
Conclusion and Next Steps
With the quantization methods set up, the last part of working on the solution was done, at least for now. We decided to draw the line here, since we believe that the core solution presented is more than enough to cover the needs we started with. With regards to the roadmap, some of the more imminent plans from our side would be to add more of the source connectors. Not try to oversaturate the code with too many but focus on some key ones which we believe could add significant value to the pipeline. One of the best examples of that would be an Excel document connector. Apart from that, we also thought about shifting our focus on enabling real-time data updates in the context database, which could lead to real-time data predictions.
We also have some less pressing features which we would like to attend to, which should prove to be very useful to the end users:
- The ability to generate a concrete document alongside the answer, e.g. a PDF file of the information returned for a request.
- User authentication, so each user’s history could be stored, similar to how ChatGPT does it.
In general, the whole experience of building this RAG-powered pipeline had a positive upskilling effect on the team involved, making each member much more knowledgeable about each aspect of the LLM landscape. Apart from that, the solution we built helped us a lot, not just in working with our knowledge base, but in understanding the quality of the data stored in our knowledge base. We’re sure it could have such an effect on any other company out there. If you’re interested in the solution proposed, would like us to build you a customized version, or anything else regarding it, be sure to contact us at info@syntio.net.
References
- https://cloud.google.com/vpc/docs/overview
- https://cloud.google.com/vpc/docs/vpc
- https://cloud.google.com/network-connectivity/docs/vpn/concepts/overview
- https://github.com/vllm-project/vllm
- https://blog.runpod.io/introduction-to-vllm-and-how-to-run-vllm-on-runpod-serverless/
- https://www.geeksforgeeks.org/paging-in-operating-system/
- https://www.geeksforgeeks.org/external-fragmentation-in-os/
- https://arxiv.org/abs/2309.06180?ref=blog.runpod.io
- https://support.microsoft.com/en-us/office/use-the-sharepoint-team-collaboration-site-template-75545757-36c3-46a7-beed-0aaa74f0401e
- https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher
- https://docs.vllm.ai/en/latest/models/engine_args.html
- https://medium.com/@techresearchspace/what-is-quantization-in-llm-01ba61968a51#:~:text=Quantization%20is%20a%20compression%20technique,the%20model%20including%20the%20accuracy.
- https://github.com/NetEase-FuXi/EETQ
- https://github.com/mit-han-lab/llm-awq
Originally published at https://www.syntio.net on October 23, 2024.