Evaluating LLM Performance in RAG Instruct Use Cases

Darren Oberst
10 min readOct 15, 2023

--

While there are many solid and widely used testing benchmarks for LLMs (see the LLM Leadership Board on HuggingFace for the best examples), there is a general shortage of RAG optimized testing benchmarks to evaluate the effectiveness of LLM inference in conjunction with retrieval augmented generation scenarios.

In our experience, RAG use cases are specialized uses of LLM inference with a slightly different set of requirements and accordingly, require a different evaluation framework:

Closed Context vs. Open Context— in RAG, generally, a context passage is being passed to the LLM as part of the prompt, and the instruction to the LLM (explicitly or implicitly) is to read the passage, stick to the content found in the passage, and answer the question (or perform the instruction / analysis), based only on the included materials. The LLM is being evaluated for its ability to read analytically and critically, not its ability to encode knowledge implicitly during its training process.

Domain — most RAG scenarios involve complex business, legal and financial materials of one kind or another, such as legal contracts, regulatory materials, financial releases, technical articles, invoices, or global events — and are densely packed with specific terms, names and numbers.

Lower Verbosity Preferred — while an interactive ChatBot is generally viewed more favorably with the fluidity of its language (and general bias towards verbosity creating a more positive “conversational” context), in RAG, often times, the bias is the exact opposite as the LLM-based inference will be integrated into a larger workflow, in which “just the facts” and answering a question specifically is more highly valued. LLM output will also often need to be directly fact-checked and/or automatically classified (e.g, Yes/No, Above Threshold, Below Threshold, Multiple Choice, “Not Found”, etc.) so a short less verbose answer like “Not Found” or “Yes” or “32” is often times preferable to a longer description.

Hallucinations — Temperature — Creativity — one of the primary intended benefits of RAG is to minimize hallucination risk, which is correlated with temperature settings in generation, as well as the preference for “creativity” and variance in the output. In contrast, RAG inference generations will generally be at lower temperature levels, shorter output, and should strive for consistency in answers, rather than creativity.

Introducing RAG-Instruct Testing Dataset

To support this initiative, we have published the open source release of a “Basic RAG Testing” dataset with 100 samples with questions, context passages and ‘gold’ answers, which is available on HuggingFace in the llmware repository at: /rag_instruct_test_dataset_0.1.

We have designed this dataset as a “basic” level RAG performance test, and have been using it internally for evaluation of a current open source experimental initiative involving the use of “little” LLMs finetuned for basic RAG scenarios. You will see this reflected in the models selected for initial evaluation with a major focus on smaller models in the sub-7B parameter category. For more information on this initiative, please check out the llmware/bling series of models. Our focus with BLING (**B**est **L**ittle **I**nstruction-following **N**o-**G**PU) models is: how small of a LLM model can still be viable for RAG use cases, and can smaller specialized “laptop suitable” models play an important role in testing RAG workflows? And perhaps most importantly, what can we learn as best practices for model fine-tuning in 1–2B parameter models that can be brought to bear in 7B-20B parameter models?

We will be developing additional testing datasets that are at more advanced levels, but we wanted to start with a simple set of baseline tests for RAG.

One of the key variables we are looking to evaluate is context window, and for this initial basic test, we purposefully kept the context window for the passages very short, e.g, 100–500 tokens. We have also included relatively few Boolean, reasoning, selection and multiple-choice questions, and limited question construction to relatively straightforward question-answering, basic analysis, key-value extraction, and summarization.

Representative questions are based on real use cases that we see from customers and partners:

  • What is the subtotal amount?
  • What are the names of the two parties?
  • Why did the intraday reversal occur?
  • How many jobs were predicted by economists?
  • What law governs Section 18(a)?
  • When can termination after a material breach occur?
  • Can the salary be decreased?
  • What is a list of the top financial highlights for the quarter?
  • Did 3rd quarter revenue increase over the previous year?
  • What is a list of the top 5 summary points?
  • What is a brief one-line description in 25 words or less?
  • What is a list of the three people who served on the COVID task force
  • What is a headline in 10 words or less?
  • What is a list of the items being purchased?
  • When was the first paper on AI published?

Categories / Model Size

For this initial blog post, to illustrate the use of the test dataset, we evaluated ten different models using this test framework, spanning from 410M parameters up to 175B+, with four target categories:

Leading Proprietary API-based LLMs (50B+ parameters) — we included the latest OpenAI GPT-4 and OpenAI GPT-3.5-Turbo and Claude-Instant-V1 as leading representative examples in this category. We expected that this test would be rather easy for these models, especially with the short passage size and relatively straightforward questions. These models are in the “mega” LLM category with 50B+ parameters, and require complex GPU topologies to train and run inference, and are designed to be all-encompassing models capable of handling not only RAG but any potential LLM use case;

Enterprise LLM (7B parameters)— this category is our primary commercial focus, and we are holding out most of this analysis for future blogs and releases. We believe that this is the sweet-spot for most RAG scenarios. For this blog post, we have included only one model in this initial run, which is a commercial (‘closed source’) model that we have been developing — aib-read-gpt — and is currently in limited release, and has been optimized specifically for RAG tasks. In a future evaluation, we will build a detailed set of tests comparing different open source 7B model bases finetuned for RAG. Generally, in contrast with the mega-LLM models, a 7B model can be served in a ‘private cloud’ inference server running on a single GPU.

Little LLM (1B — 3B parameters)— we included 4 models that we have fine-tuned on RAG and published in HuggingFace in the BLING (Best Little Instruction-following No-GPU) model series. Each of the models tested below are in the 1–1.5B parameter range, and can run effectively on a CPU laptop out of the box (e.g., no special quantization needing to be deployed).

Sub-1B — we provided test results from two smaller RAG-finetuned models, both built on leading open-source foundation models (Pythia and Cerebras), although note that neither of the fine-tuned RAG-instruct models have been released at this point.

Evaluation Metrics

In the table below, for each of the models, we captured the following:

— Model Parameters

— Delivery Mode: open-source vs proprietary, and multi-GPU vs GPU vs CPU inference

— Output: Output Tokens and Output Tokens as % of Input

— Processing time for the 100 test questions

— Score — 0–100 on judged correct answers against the 100 questions in the test dataset

Note: the results were gathered using manual evaluations, with some subjective determination involved whether to credit or reject a particular answer. Best efforts were applied to be consistent across the evaluations. We also did not apply any special model-specific instruction / hyper-parameter tunings. Since the test dataset is publicly-available, we would encourage anyone interested to run their own tests and comparisons. We used a simple inference loop script built in llmware (which we will be posting separately at our main github repo — llmware-ai/llmware.) Across multiple testing runs, results may have minor variances.

Key Take-aways:

OpenAI and Anthropic Models — the leading API-based models were essentially flawless at this basic RAG-Instruct test. (We will be rolling out more complex tests in the future, more targeted at these models, but on a basic test, with relatively short passage contexts, it was easy work for gpt-4, gpt-3.5 and claude-instant.) Each of the three models tested draws only from the context passages, and gets essentially every answer correct. One notable observation is the verbosity of Claude in “based on this passage” mode. We debated whether to award “101” or “99” for its lengthy explanations and showing its logic. Claude-Instant used more than 2X the tokens of the OpenAI models, and nearly 3X the aib-read-gpt model. We purposefully did not try to optimize with model-specific instructions, and gave essentially the same core instruction to both OpenAI and Anthropic models. Also, we ran these tests at a ‘low traffic’ time on Saturday early morning (NY), and did not have any issues with availability or performance.

aib-read-gpt— this is a model that AI Bloks has in beta currently that performs well on this instruct scenario. We feature this model as a representative example of a strong baseline 7B RAG-instruct model, which we know well since we have been involved in training it. Also, notably the model is very concise in its answers by virtue of its training, which focuses on answering the question directly.

BLING models— in contrast with the first set of models, all of these models are pulled directly from HuggingFace, with Apache 2.0 licenses, and can run locally on a laptop. The processing time cycles are not “apples-to-apples” as a result, since the model inference processing is done on a local CPU (e.g., Mac M1 laptop). (If we were to run these models on specialized hardware, with GPU acceleration, the times would be an order of magnitude faster at least.) We were especially impressed with the results of bling-falcon and bling-pythia-1.4b.

llmware/bling-falcon-1b-0.1 is the best performing of the 1–2B parameter models with a score of 77 accurate answers out of 100. In addition to nearly 80% accuracy, the model did not have any visible hallucinations. The wrong answers were generally understandable, and were akin to the kinds of ‘careless’ errors that a person might make: pulling the wrong number from a text (e.g., referring to Section 18A vs. Section 19, when both are included in the context), using the wrong party name (e.g., referring to the “sending party” not the “receiving party”), or not correctly identifying a “not found” or how to correctly answer a “Yes/No” question.

llmware/bling-pythia-1.4b-0.1 is the second best performing with a score of 65 accurate answers, compared to a score of 59 for the 1b version. Similarly, there were not visible hallucinations, and the responses were generally sensible. Without any hardware optimization or special configurations, the Pythia models were also the fastest in running local inference.

Models in the 1–2B parameter range can reliably decode basic RAG-instructions. With few exceptions, these little models could recognize the right form of an answer (e.g., provide a name, number, address, date, summarization), and provide answers that were sensible. There were ocassional repetitions (especially with lists and summaries), and some “non-responses”, both of which are more likely to be artifacts of over-fitting in the fine-tuning process than reflections of intrinsic limitations of models of this size. Summarizations were often times more “net” and lacking in expressive description, but generally fact-based in terms of extracting key data points from the text (if not always complete).

Areas of Further Research. We have done limited training and testing on 1–2B parameter models with more complex instructions, such as multiple-choice, pattern recognition, common-sense math and logic, but we do see signs that these more “complex” behaviors may be challenging to implement on these smaller models. Finally, we have purposefully kept context window sizes smaller, as we believe that these smaller models will have more challenge with larger context passages. In future blog posts and releases, we will explore both diversity and complexity of instructions, and context window size as areas of further evaluation.

Finetuning artifacts. While the performance of the 1b Falcon base appears to be the strongest so far among the series, we would discourage drawing any conclusions about the efficacy of the base models. As an example, we did not include any invoice samples in the fine-tuning training dataset, although there are invoice samples in the testing dataset. Consequently, if there is a variance in performance among the base models on the invoice test questions, it is likely due to whether invoice or invoice-like materials were included in the base pre-training. Furthermore, certain bases may have had pre-training hyper-parameters more similar to those used in finetuning. The only conclusion that should be fairly drawn is the efficacy of the combination reflected in the current bling model. We believe that with further fine-tuning experimentation and optimizations, comparable results could be achieved on any of the three high-quality foundational base models being used currently (Pythia, Falcon and Cerebras).

Models below 1B in parameters. While we believe that the testing results show that models in the 1–2B range merit further experimentation and research, the results for models below 1B were rather discouraging for use in RAG. In addition to overall scores in the 30s, in contrast with models in the 1–2B range, there were notable extreme inaccuracies in these sub-1B models, suggesting that the model was not able to consistently encode the instruction and follow it. In all of our testing, we have struggled to get a decoder-based model with less than 1B parameters to *consistently* respond in a sensible manner to questions. Note the emphasis on *consistently* — as among the 30+ correct answers, there were the occasional remarkably good answer.

Next Steps

We will be rolling out improvements to this dataset, as well as applying more detailed datasets, including new smaller LLMs that we roll out on HuggingFace from time-to-time.

If you are interested in this initiative, please check us out on github at: https://github.com/llmware-ai/llmware and https://huggingface.co/llmware on HuggingFace. We believe that RAG-instruct in smaller, specialized LLMs is going to be a major use case. We welcome feedback and collaboration from the community.

--

--

No responses yet