How we created RuterGPT — AI-lab at Ruter — Part 3

Published in

Ruter — Produktutvikling og teknologi

8 min readJan 8, 2024

This is part 3 of a series of articles about the AI-lab at Ruter. In the first iteration of AI-lab, the students were to figure out a way to help the customer support team, read part 1 here! For the next iteration, a new group of four students were tasked with trying to predict delays across Ruter’s huge transport network, part 2 can be found here. For the 2023 edition five students were tasked with creating a local Large Language Model (LLM) for the Norwegian language.

This article explores the development of RuterGPT. It details how Llama-2 was fine-tuned to understand and interact in Norwegian, tailored to the specific needs of Ruter AS.

Foto: Ruter As / Redink, Thomas Haugersveen

In recent years, the technology of language models has seen significant progress. Many people, initially skeptical of new tools like ChatGPT, have found their expectations exceeded by models such as GPT-3.5. As familiarity with this technology increases, concerns about data privacy and ethical issues, including access to chat logs and potential corporate influence, have also arisen. Questions about how businesses might respond to sharing their data with companies like OpenAI persist, even though cloud services like AWS and Azure have made significant efforts to ensure data privacy, one can never be too cautious in this domain.

A few months after the release of ChatGPT, Ruter offered students, specializing in AI, linguistics, and business, to participate in Ruter’s AI lab. This program unfolds over a span of three months each summer, with this year’s emphasis on developing an in-house LLM for several Proof of Concept (POC) tasks, given company’s primary concern about data confidentiality if they were to use ChatGPT, as follows:

· An HR-bot to facilitate employees with basic HR inquiries.

· Document Q&A for interacting with documents or documentation

· Summarization and sentiment analysis

· Insights from a huge amount of text from customer feedback, app reviews, and historical documents.

During our preliminary research in early June 2023, we understood that training a LLM from scratch would be prohibitively expensive. To save on costs, we decided to fine-tune an existing open-source model to create RuterGPT.

At the beginning of our project, our research indicated that Falcon AI held a top position on the Open LLM leaderboard, having been released as an open-source LLM under the Apache License Version 2.0. This license made the Falcon model widely accessible for research and commercial purposes without fees. Although there were expectations about the release of Open LLaMA and LLaMA-2, the Falcon model was our main choice at the onset of our project. Therefore, we allocated our team to explore the use of the Falcon model while also keeping track of the progress related to LLaMA for our tasks at Ruter’s AI lab.

Our initial research indicated robust community support and technological advancements centered around the LLaMA models. Coincidentally, OpenLLaMA was released a week after we started our project. Within that week, our investigation into the LLaMA model family yielded promising results, leading us to prioritize it. The strong community engagement and the quicker inference times of LLaMA-based models compared to Falcon were determining factors in our decision.

We decided to focus on OpenLLaMA because of its licensing terms, while also monitoring the progress of LLaMA-2. A key benefit of OpenLLaMA was the availability of its 3B parameter model. Although this model was not suitable for producing high-quality fine-tuned models due to its size, it was an excellent starting point for experimenting with fine-tuning techniques without the necessity for advanced GPUs. This phase highlighted the critical role of data and the complexities involved in creating a unique Norwegian dataset for our project. We investigated leading models and their datasets, evaluating the feasibility of adapting any existing Norwegian datasets for our purposes.

Datasets

When seeking resources for Norwegian natural language processing (NLP) tasks, The Norwegian Language Bank (Språkbanken) is a prominent option, reflecting National Library of Norway’s (Nasjonalbibliotekets) initiative to provide open-source datasets. Although ideal for pre-training LLM, our focus on fine-tuning necessitated a broader search. We used datasets from NbAiLab and our own generated dataset, comprising approximately 70,000 pairs of instructions.

To improve our model with domain-specific knowledge, we incorporated approximately 500 handcrafted data points relevant to Ruter AS, focusing on Oslo and Viken. However, we acknowledge that the quantity of our custom data may be too limited for the model to significantly learn new information.

A part of our dataset was translated to Norwegian with machine translation. While recent advances in machine translation are impressive, it often lacks contextual understanding, which can lead to errors in context and grammar as illustrated in figure 1. The figure illustrates how the words “stream” and “ran” are translated without contextual consideration, leading to nonsensical output. This exemplifies the challenges machine translation faces in comprehending context, often resulting in inaccurate translations.

Given our time constraints as a small team, we had to accept that these translation errors might negatively impact on our model. Ideally, dataset refinement and correction would improve results. A feasible method for enhancing dataset quality could involve using a grammar correction tool. We also considered employing GPT-3.5 for translation, expecting better and more cost-effective outcomes. However, due to uncertainties with OpenAI’s licensing terms, we decided to not pursue this approach.

Figure 1 — Issues with machine translation

At a late stage of the project, we discovered Google’s Flerus dataset, containing human-translated texts. We converted this data into the Alpaca format, contributing approximately 1,300 rows of high-quality translations to our dataset. Subsequently, we released the Flerus-Alpaca-EN-NO and OpenOrcaNo-15k datasets available on huggingface.co.

Tokenization

In an LLM, characters and words are encoded into a sequence of numerical tokens. For example, the phrase ‘public transport’ might be tokenized into a sequence like [1, 5236, 8608 ]. Here, ‘1’ could be a special token indicating the start of a sentence, ‘5236’ might represent ‘public’, and ‘8608’ for ‘transport’. In this hypothetical example, it takes three tokens for these words. The specific token values depend on the model’s predefined tokenization rules.

When applying this to Norwegian, tokenization might differ due to linguistic differences and may not align perfectly with tokenization patterns developed primarily for English. For instance, if we tokenize ‘Offentlig transport’ in Norwegian, we might get [1, 5947, 296, 3473, 8608]. In this scenario, the model hasn’t seen as much Norwegian during initial training, so it might split the word ‘Offentlig’ into ‘Offent’ [5947], “ent” [296], “lig” [3473] and then use the same token for “transport” [8608]. Hence, tokenization was a challenge we faced when training the model on Norwegian data.

While the accuracy of generated content remains largely unaffected, the model’s efficiency can be compromised. This is particularly due to the model’s unfamiliarity with Norwegian nuances from its initial training. It might utilize approximately 25–35% more tokens per sentence compared to English. This increased token usage translates to longer inference times, making the process less efficient than it could be.

Training

To keep training costs down, we used the LoRA algorithm with our own training script made with Huggingface Transformers. The parameters used during training are shown in Table 1.

Utilizing an AWS SageMaker ml.g5.12xlarge instance equipped with 4x A10G GPUs, which provided us with a total of 96 GB of GPU RAM, we fine-tuned our 13B parameter model for one epoch, taking 72 hours, using a combined dataset of Alpaca and OpenOrca. When fine-tuning the 7B parameter model, we employed the same instance type but introduced two modifications: we incorporated the Flerus dataset and extended the training to five epochs. This approach was adopted to determine whether a longer training duration could enhance the model’s performance. The outcome of this training was the development of two Norwegian-proficient models: RuterGPT-13b and RuterGPT-7b, both hosted at huggingface.co.

It is also worth noting that we chose to fine-tune only the q_proj and v_proj layers of the model. This decision was guided by our aim to mainly inject the Norwegian language to the model, rather than to necessarily provide it with too much new information.

However, the challenge remains: how does one evaluate new language capabilities for an LLM? Most metrics used to evaluate models test the knowledge of the model, and not specifically the grammar.

Evaluation

For monitoring the LLM’s progress in Norwegian, we decided to create our own evaluation schema. We used Rouge scores which compare the model’s output with a reference. Additionally, we implemented a semantic scoring system to compare how similar the meaning of the model’s output is to the human output.

While this evaluation method was by no means perfect, it provided us with valuable pointers and helped us determine which parameters needed adjustment to achieve the desired output. All tests were run against a set of 71 input-answer pairs not included in the training data. One can argue that this set is too small and limited, and for a general purpose LLM we agree, but since we’re only teaching the model a new language, these tests gave us valuable insights into the model’s performance.

Given the notorious difficulty of evaluating LLMs solely through numerical scores, we emphasized extensive real-life testing to verify the model’s ability to write in Norwegian. The metrics served more as a confirmation that our efforts were quantifiable. As illustrated in Figure 2, there is little variation in the metrics across our fine-tuned models.

Next steps

The next step was to integrate RuterGPT across various domains within our organization. Among these implementations was the HR-bot, designed to address HR-related questions using the RAG (retrieval-augmented generation) system, and replacing the exsisting NLP models in Ruter with RuterGPT. In the next article, we will share insights into the HR-bot’s implementation and deployment process, further highlighting the potential and versatility of RuterGPT.

Acknowledgement

Our sincere thanks to Simen W. Tofteberg and the entire data science team at Ruter for their support and assistance with this project.

Summary

In “How we created RuterGPT — AI-lab at Ruter — Part 3,” a team of five students detail their development of RuterGPT, a Norwegian language Large Language Model (LLM), for Ruter AS, a Norwegian transport company. The project, evolving from aiding customer support and internal usecases, involved fine-tuning the existing Llama-2 model to Norwegian specifics. The project also included the development of an HR-bot using RAG and integration into Slack, emphasizing real-world testing within the company.

Written by Frencis Balla, Nikshubha Kumar, Maryam Lotfigolian, A. Theo Strand and Solveig H. Willoch