Evaluating RAG capabilities of Small Language Models

Published in

Data Science at Microsoft

11 min readJun 25, 2024

Image generated by Image Creator from Microsoft Designer on Bing.com

Large Language Models (LLMs) are powerful Deep Learning models capable of generating high-quality responses for tasks like question answering, summarization, or sentiment analysis [1]. For these reasons LLMs are the backbone for copilot and chatbot applications such as GitHub Copilot, Microsoft Copilot, and ChatGPT. It should be noted, however, that these applications and other similar ones don’t rely only on the models’ intrinsic capabilities for providing top-notch user experiences; instead, they are enhanced by injecting relevant information into their context window to drive evidence-based results. This is the main idea behind using a retrieval augmented generation (RAG) system when creating these types of applications.

In this article we focus on Small Language Models (SLMs), which are based on the same architecture as LLMs but are smaller in parameter count. Our primary goal is to evaluate the performance of SLMs in answering user queries when given an augmented prompt, similar to what would be observed in a RAG system. The purpose behind our experimentation is to measure the feasibility of using an SLM, which is more compute-efficient and environmentally friendly, instead of an LLM in RAG or chatbot applications.

Throughout this article we refer to SLMs as the family of models that are near the 7-billion parameter count. We therefore consider the following models, as well as their fine-tuned variants, when running our experiment:

Table 1: Model family we used for our experiment

We also define RAG as the parameterization of a model’s input with the user’s query and some retrieved media, and in our case the media is a document. RAG-based applications also employ several techniques, such as chunking, vector embedding similarity search, and re-ranking, for better results. For our experiments we assume that this has all been done beforehand and that we are working with previously parameterized prompts.

Needle-In-A-Haystack

Description

The Needle-In-A-Haystack (NIAH) [2] evaluation test consists of inserting a piece of information (the needle) into a document (the haystack) and evaluating the model’s ability to recall the inserted information by asking a question that pertains to it. This makes it a great test bench for estimating the performance of a model in a RAG or memory system.

How NIAH works

NIAH works by creating a prompt based on a (context_length, document_depth) pair, where the document depth can range from 0 to 100 percent and the context length can go as high as the model’s max context length. The prompt is created by embedding a sentence (the needle) that is document_depth deep into some document (the haystack) while ensuring the overall token count does not surpass context_length. A retrieval question that pertains to the embedded sentence is a part of the final prompt. It is the language model’s goal to find the embedded sentence within the document to successfully answer the question. To evaluate the model’s response, we use GPT-4 1106-preview [3] and ask it to score how close the response is to the embedded sentence, which is the ground truth, on a scale of 1 to 10 with the following criteria:

For example, suppose we have the following as our needle and haystack:

Needle: “The best thing to do in San Francisco is sit in Dolores Park.”

Haystack: Microsoft_quarterly_earnings.txt

Question: “What is the best thing to do in San Francisco?”

Let’s also assume we will range from a context length of 500 to 1024 and a document depth of 0 to 100 percent, both with 35 bins using a linear spacing algorithm. From this linear spacing let’s consider the (700, 38) pair. We first select a portion of the haystack so that it has 700 — tokenCount(needle) — tokenCount(question) tokens and then embed the needle “38%” into the trimmed haystack. Finally, we append the question at the end and query the model with this final prompt. Once the model has generated a response, we use GPT-4 to evaluate it.

Figure 1: GPT-4 128K performance on NIAH. Image reproduced from NIAH repository [2].

Setup

All the models in our experiment are freely accessible through HuggingFace, the only caveat being that the evaluator is GPT-4, which requires an OpenAI or Azure OpenAI API key. The generation configuration used for all models consists of the following:

{“temperature”: 0.7,

“do_sample”: True,

“repetition_penalty”: 1.3,

“max_new_tokens”: 100,

“use_cache”: True}

Because Microsoft’s Phi-2 model is a base model that has not been fine-tuned, we decided to use the most-liked fine-tuned Phi-2 model on HuggingFace as of the time of our experiment. This led us to using cognitivecomputations’ dolphin-2_6-phi-2 [4], which we refer to as Phi-2 fine-tuned from here on. We used the number of likes as our deciding metric because it better showcases satisfaction with the model more than the number of downloads, unless biased by members of the cognitivecomputations community. (On the other hand, Microsoft’s Phi-3-mini model is available only as a fine-tuned variant.)

For each model, the minimum context length is 500 tokens while the maximum is set to 2048. Most models we are testing have a full context length larger than 2048 but we are not focusing on pressure testing the models and instead care only about the model’s retrieval ability. Therefore, we considered 2048, the smallest context length in our model family, sufficient for our tests.

Partitioning of the context and document length is done with a linear spacing algorithm into 35 bins, which is the default implementation in the NIAH repo.

All models were given the same prompt, which is shown below:

“{context} Only answer the following question and nothing else. Keep your response as concise as possible. {retrieval_question}”

where context represents the haystack with the needle already inserted and retrieval_question is the question we ask the model that pertains to the needle. When the model allows for a system prompt we use the following:

“You are a helpful AI bot that answers questions for a user. Keep your response short and direct.”

Results

Our results are divided between base models and fine-tuned models.

Base models

We started by running our experiment on the base variant of each model. The base variant, also known as base or foundation model, is the model right after training but before any sort of fine-tuning or human alignment. In other words, the base variant has learned how to model the language it has been trained on but is weak in following instructions or answering questions. For this reason, we expected poor performance across all our base variants, which turned out to be the case except for Phi-2. (We also saw OpenELM 3B as a strong performer, but we do not dive into it here.)

Phi-2 is a 2.7-billion parameter model [5] that has taken the learnings of LIMA [6], the latter of which stresses the importance of high-quality and diverse data and was trained on 1.4 trillion tokens. This makes its performance impressive when considering that it does better than larger models, as in the case with Llama2 7B, and models with more training tokens, such as Gemma 7B that was trained on 3 trillion tokens [7].

Figure 2: Base variant performance of all models we used on NIAH. Image by the authors.

To verify Phi-2’s performance we re-ran the experiment two more times and took the average across the three runs. We can see that its performance, although noisier, is still visible. It is natural to wonder whether this is a direct cause of Phi-2’s dataset or how it was trained. Although we believe the high-quality dataset plays a valuable role, we also noticed that the authors of Phi-2 (which is Phi-1.5 scaled up) have mentioned that it is capable of following prompts formatted for instruction and question-answering [8], which plays a major role given the nature of our experiment.

Figure 3: Phi-2 base variant’s performance on NIAH. Image by the authors.

Overall, it is clear that none of these base variant models should be used for a RAG or chatbot application. Even with Phi-2’s performance it is still not consistent enough to generate accurate results and deliver a good user experience. Furthermore, Phi-2 shows effects of lost-in-the-middle [9] as seen by the region of red that is most prominent between 773 to 1365 tokens and only slightly more diluted between 1456 to 1866 tokens.

Fine-tuned models

The results for fine-tuned or human-aligned models are what we consider most valuable because these are the type of models commonly deployed and in use. As the results show, all models perform significantly better when compared to their base variant. Each model has something worth discussing.

Figure 4: Fine-tuned variant performance of all models we used on NIAH. Image by the authors.

Phi-2’s performance is rather poor and riddled with red due to how it was fine-tuned. Every other model in our set underwent instruction tuning alone or instruction tuning with human alignment through Reinforcement Learning with Human Feedback (RLHF). In contrast, Phi-2 was fine-tuned using QLoRA [10], a parameter-efficient fine-tuning (PEFT) method. QLoRA first quantizes a model’s weights into four bits and then applies a low-rank adapter (LoRA) matrix to certain modules; for this fine-tuned model it was applied to the key, query, and all linear modules. We believe that this choice of limited and quantized fine-tuning inhibited Phi-2 from obtaining a stronger result on this task. We also hypothesize that if Phi-2 was fine-tuned using RLHF it would have performed much more strongly, as we can see is the case with Phi-3-mini, although this is beyond the scope of this article.

A more surprising outcome is the difference between Gemma 2B and Gemma 7B. Given that both models saw the same data and one is simply larger than the other, we assumed that the larger one would trump the first one, but that is not the case. Upon closer inspection we see that Gemma 2B is much more concise while the 7B variant is more creative, as we see below:

Example 1 (our favorite): According This Article And My Opinion As A Large Language Model With No Personal Bias Of Course… **Eating sandwiches while sitting peacefully amidst nature within Golden Gate park.**

Example 2: Sure here’S answers below According tothe content provided; **Best Thing To Do While Visiting SAn Fransisco:** Eat A Sandwich And Sit Down With Your Feet Up At Doyle park On Sunny Day

Both OpenELM models massively underperformed and were not reliable in any region. This is something that could be expected of OpenELM 1.1B because of its size but not for OpenELM 3B. We also analyzed the responses from both models. For OpenELM 1.1B, it seems like the model was failing at being able to answer the question and would deviate from doing so. This statement is supported by the fact that we saw plenty of responses where the model generated multiple-choice options, most likely from seeing that the last sentence is a question and knowing that multiple choice options usually follow:

Example: a) Eat a sandwich and sit in Dolores Park on a sunny day.b) Work out at a gym c) Go for a run d) Start a startup 3) a) Eat a sandwich and sit in Dolores Park on a sunny day.Answer: b) Start a startupThe best thing to do in San Francisco is to start a startup because:It gives us the opportunity to apply our unique skills and talents to create products that others

For OpenELM 3B it is a bit different, and its shortcoming is due to the model’s inability to follow the prompt and give a concise answer. From these observations we decided to do some manual data manipulation and score the model’s response with a 10 if it at least included the “Eat a sandwich and sit in Dolores Park”, which we define as lenient scoring.

Figure 5: OpenELM performance on NIAH with lenient scoring. Image by the authors.

We can also define strict scoring as requiring the response to include “Eat a sandwich and sit in Dolores Park on a sunny day”.

Figure 6: OpenELM performance on NIAH with strict manual data manipulation. Image by the authors.

Of course, this is not the model’s true performance because, as previously mentioned, low-scoring responses were due primarily to verbose or hallucinated details.

Finally, Phi-3 shows strong performance on a context length shorter than 864 but deteriorates on anything longer. Most notably, Llama2 7B excels on instruction following for all (context_length, document_depth) pairs, which is not a surprise given that Meta used over 1 million binary comparisons for reward modeling in RLHF and approximately 30 thousand supervised fine-tuning annotations [11].

Future work

Our experiments have shown that SLMs are capable of handling prompts commonly seen in RAG or chatbot applications. Nonetheless, we have only scratched the surface of what is observed in those environments; for example, our experiment does not consider multi-step reasoning or knowledge aggregation. More robust and exhaustive experimentation is required before making such a decision.

We encourage the use of Nvidia’s recently announced RULER benchmark [12] for further experiments. RULER encompasses NIAH and includes other benchmarks such as multi-hop tracing, question-answering, and aggregation.

We noticed during our experimentation that the quality and quantity of the fine-tuning dataset can have a substantial impact on the model’s performance. This is an area we deem worth investigating to find the optimal number of total tokens for fine-tuning and comparing it to different strategies, such as supervised fine-tuning, RLHF, and others.

Throughout the experiment we used a single simple prompt. We believe this leaves room for improvement by constructing a more elaborate prompt. One idea we came up with was finding a stronger prompt where stronger is defined as a majority improvement over prompts that received a score of 5 or less. Because this article is long enough already, we did not employ this thought but consider it worth exploring because it could entice the model to more closely follow the prompt and as a consequence produce a better response. For the curious reader we include the performance of all models after lenient and strict scoring:

Figure 7: Performance of all models on NIAH with lenient data manipulation. Image by the authors.

Figure 8: Performance of all models on NIAH with strict data manipulation. Image by the authors.

Conclusion

Our experiments evaluate the performance of several SLMs on the Needle-In-A-Haystack benchmark to gain a better understanding of how such models operate under a RAG or chatbot application. We observed strong performance from fine-tuned models such as Gemma 2B and Llama2 7B, while other models weren’t fit for the task. Overall, we believe this area is worth exploring due to the low resource requirements and environmental friendliness of SLMs during a time that so many applications powered by language models are being deployed.