Deployment of RuterGPT — AI-lab at Ruter — Part 3.2

Published in

Ruter — Produktutvikling og teknologi

10 min readJan 21, 2024

This article serves as a follow-up to our previous piece, “How we created RuterGPT”, in which we fine-tuned a large language model (LLM) for Norwegian. Here, we delve into the implementation and deployment of this model within our organization.

We’ll first explore the HR use case, followed by a detailed description of the deployment process.

HR usecase with RAG

After fine-tuning the base model and evaluating different models, we selected the best one to tackle the next task: the development of an HR-bot powered by an LLM. This task was done by implementing Retrieval Augmented Generation (RAG), which enabled the bot to effectively handle HR-related queries.

A common issue with LLMs is ‘hallucination’, where the model makes up information when trying to answer user queries. To prevent this, we connected the LLM to our private data source. This enables the LLM to use real, external information provided in its prompts to accurately respond to questions. This method, referred to as RAG, helps prevent the model from providing incorrect answers by grounding its responses in verifiable data as illustrated in figure 1.

We followed the following steps to implement the HR bot.

Loading the external data

The HR data at Ruter is hosted on an internal server, which was scraped and saved as “.txt” documents. Besides, the HR provided a FAQ document, in “.docx” format. These documents were loaded using Document Loaders provided by LangChain. The same document loaders are also available in Llama Index. There are no significant changes in the loaded content when using either of the two libraries.

Splitting the data

LLMs like ours have a limited context window, in our case, 4096 tokens. To ensure relevant data fits within this window, it’s necessary to divide the loaded data into smaller segments which are called chunks. Overloading the prompt with too much information can confuse the LLM, potentially leading to incorrect responses. Both LangChain and Llama Index offer various Text Splitters for this purpose. After experimenting, we selected the RecursiveCharacterTextSplitter, from LangChain and set it to a chunk size of 512. The results obtained from the RAG system are highly sensitive to the chunk size. No universal chunk size will achieve the best result, as it depends on the type of content and the questions asked. Evaluating the chunk retrieved by the RAG system for different chunk sizes would help you reach the best one.

Creating Embeddings

Finding the right embedding model was a major challenge for us. Most of the embedding models available work seamlessly with English, however, with not many embedding models available in Norwegian we had to rely on multilingual embedding models.

Using Llama Index Metrics for Retrieval Evaluation, we were able to evaluate the embedding models quantitively. These metrics are as follows:

1. Hit Rate

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

2. Mean Reciprocal Rank (MRR)

For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on. To illustrate the evaluation of embedding models, we will compare two distinct models in this example as illustrated in table 1.

In comparing two embedding models, it is evident that the ‘intfloat/multilingual-e5-large’ model outperforms ‘NbAiLab/nb-sbert-base’. A key observation from this comparison is that the MRR being lower than the hit rate suggests that the highest-ranking results are not always the most relevant. This discrepancy highlights the importance of evaluating models beyond just their top results to measure their overall effectiveness in finding relevant information.

Vector Store and Retrieval

Following the embedding model, we chose Chroma DB, an open-source database for managing and storing vector embeddings. While we are open to exploring other vector databases in the future, our current experiences with Chroma DB have been satisfactory, showing no performance issue.

To find the relevant information based on a given query, we use similarity search in our retriever.

We often encounter follow-up questions from users during interactions with the HR bot. To manage them, it is important to maintain a consistent context throughout the conversation. To address this problem, we utilized ConversationBufferMemory for initializing and managing the chat history.

When the user sends in a new query, the model looks at the history and the query itself to create a standalone query. We did this process by utilizing prompt templates derived from LangChain documents. By retrieving documents relevant to the standalone question, we ensure that the generated response is based on both the conversation history and the retrieved documents, not only the user’s query. After each response, our model updates its memory, integrating the new interaction to enhance future responses.

Generation

The retrieved chunk and the user question are given as input to the base model to answer the question. To prevent the LLM from hallucinating, we used the following two ways:

1. The LLM temperature was set to 0 to make it deterministic, factually grounded and less creative.

2. A custom prompt to further prevent LLM from making up incorrect information, ensuring it incorporates the context provided in the input effectively.

"""<s>[INST]
<<SYS>>
You are a helpful chatbot which answers HR questions at Ruter. 
Given the following information and a question, provide a precise answer 
to the question based on the information. Do not add extra information. 
If you can't find the answer in the information below, say you don't know.
<</SYS>>
{context}

Human: {input} [/INST]
RuterGPT:"""

After following the procedure mentioned, the results obtained were exceptional as compared to the previous HR Bot that was being used. Some of the main advantages observed were that the changes in the documentation could be integrated to the bot almost instantly and that the answers were more precise and relevant. With the source documents added to the answers as a link, the bot provided enhanced usability to the employees.

A further enhancement of this solution is to enable the user to attach confidential documents and query them. This would help the employees walk through the documents without having to read them.

Integration with Slack

At this point we had access to a model that provided satisfying results in Norwegian casual conversation tasks. Even though the model was performing decently on our tests, we needed real users to start testing it, so that we could get some real-world feedback. We knew that to do this effectively, we needed to make the app easy to use and very accessible. What better way to do this in a company setting than developing a Slackbot frontend for it.

The use case we are dealing with fits right into the conversational style usage that a typical Slackbot has and in addition to that, all we needed to do to make the bot accessible is to add it to the company’s Slack workspace and everybody would be able to use it. We also took advantage of the bot’s ability to be added to Slack channels, therefore rendering it visible to a large number of users at once (e.g. the HR department) which meant that we could get a lot more valuable feedback for our use case.

Generally, the bot is responsible for receiving relevant message events from Slack and then running inference with RAG towards the model by using the text in the message event as a prompt. The information flows through the application just like shown in figure 2. With all these components in place, our model is ready to start predicting tokens.

Figure 2— Diagram showing how the information flows inside the Slackbot application.

Depending on the prompt sent, generation time can vary from a few tens of seconds to a few minutes. This is not an acceptable waiting time when it comes to any type of user facing application. To mitigate this, LLMs have the ability to stream their output which means that the user will be able to see results almost instantly and start reading. During the time of writing this article, Slack does not support any form of data streaming, so we had to be creative with our solution.

To simulate this feature, when the bot got a message that it would want to handle, we sent a placeholder message as a response to the user query. While the information had flown into the pipeline and the model was ready for inference, we streamed its response and iteratively updated the placeholder message that was sent at the beginning. This allowed for results to be shown as soon as the first token was predicted, and the only bottleneck was the speed of our production environment GPU.

Deployment

After having all the necessary elements in place for running and using the application, we needed to set up a reliable way of serving it. This was no trivial task as our system was not composed of just one application but two separate ones which were dependent on each other at all times. Additionally, LLMs are notorious for requiring a relatively large number of resources therefore scalability was also one of the main areas of consideration.

With all these requirements in mind, AWS was our cloud computing provider of choice. It offers products like Sagemaker and Elastic Container Service which are a perfect fit for our use case. Firstly, we decided to use the Endpoints feature in Sagemaker to host our model since it provides autoscaling and monitoring options out of the box. Secondly, we chose ECS to host our inference code and Slackbot server.

Endpoint

The responsibilities of the endpoint are simple, it will only serve to host the model alone and run inference without any additional functionalities to it. This decision was made in order to keep the points of failure at a minimum in this part of the application.

Due to the immense number of calculations that a 13 billion parameter LLM has to perform, some optimizations need to be in place especially before making it available to a user base so that we can ensure acceptable inference times. Dealing with optimizations for transformer-based ML models can become a technical rabbit hole, if one is not careful, so to avoid that we decided to use one of Amazon’s prebuilt LLM containers for inference. After careful consideration, the container of choice was Amazon Hugging Face TGI. This pre-built container included features like the following:

Quantization: this feature allows us to load the model in smaller instances which can bring the inference speed down but will help us keep the costs down as well.

Huggingface Text Generation Interface: highly relevant for us since we store the model weights on Huggingface, therefore facilitating the model retrieval step significantly.

Flash Attention: this is an optimization to the signature attention mechanism of transformers architectures which considerably increases the inference speed of the model.

Parallel GPU Inference: this means that, if we want to, we can run inference on the model at full precision by storing it in multiple graphics cards which will yield better and faster results.

After the pre-requisites were ready, we deployed the endpoint on a ‘ml.g5.12xlarge’ (which is an instance with 96Gb of GPU memory available) instance running at full precision. During our testing, the inference speed of the model hosted on that instance reached up to 50 tokens per second which is considerably faster than it would be on a smaller instance with quantization.

Initially our user base is not extremely large so for scaling strategies we went with a default of only one instance active at a time. That being said, the option to efficiently scale in the future as the user base grows is there thanks to Sagemaker Endpoints.

Communication

As shown in figure 3, both parts of the system live inside a Virtual Private Cloud. By isolating the hosting environment, we made sure that we had full control over which services had access to the system, avoiding any unnecessary data leaks to the outside. Once the application stack was deployed, the communication flow was as follows:

Step1: Once the Slackbot task was active, it established a socket connection with the Slack workspace which reported all events related to the bot.

Step 2: When the right event happened, it triggered the Slackbot which in turn sent a HTTPS request to the endpoint.

Step 3: After the endpoint started predicting tokens, it immediately starts to stream each predicted token back to the Slackbot

Step 4: Once the Slackbot receives the first token, it starts updating the text response in the Slack workspace.

Figure 3 — Diagram showing the overall deployment architecture of the solution.

Next steps

As the implementation of RuterGPT expands within the organization, we recognize the importance of further improving its capabilities. To achieve this, we are considering training the Mistral model with Norwegian data, and explore DPO tuning. Additionally, we are anticipating the release of Llama-3, and the possibility of fine-tune it with our data sets.

This project has underscored the significance of localized LLM’s for our company. Moving forward, we aim to extend the application of this technology across various domains, including customer service, data analytics, and document management, to further streamline our operations and services.

Acknowledgement

Our sincere thanks to Simen W. Tofteberg and the entire data science team at Ruter for their support and assistance with this project.

Summary

In “How we created RuterGPT — AI-lab at Ruter — Part 3”, and “Deployment of RuterGPT — AI-lab at Ruter — Part 3.2” a team of five students detail their development of RuterGPT, a Norwegian language Large Language Model (LLM), for Ruter AS, a Norwegian transport company. The project, evolving from aiding customer support and internal usecases, involved fine-tuning the existing Llama-2 model to Norwegian specifics. The project also included the development of an HR-bot using RAG and integration into Slack, emphasizing real-world testing within the company.

Written by Frencis Balla, Nikshubha Kumar, Maryam Lotfigolian, A. Theo Strand and Solveig H. Willoch