How to Evaluate LLMs for RAG?

Darren Oberst
6 min readNov 5, 2023

--

Introducing New RAG Instruct LLM Benchmark Performance Test

In mid-October, we released our first test datasets to evaluate LLM performance on RAG inferences. Based on feedback we have received, along with detailed assessment of the results across multiple model types and sizes, we have rolled-out an enhanced version of this RAG performance benchmark test, which is available here: llmware/rag-instruct-benchmark-tester.

As RAG has emerged as the leading paradigm for “fact-based” enterprise LLM applications, we still find a lot of questions and confusion about the level of accuracy that can be expected, as well as the suitability of different LLM models to perform this role. Most of the LLM test benchmarks are designed for research scientists — and at times, can be a bit theoretical when compared to a specific use case — and fail to answer the most important question to an end user: What is the accuracy level that I should expect in production, and will it be high enough to make this use case viable?

Another practical question that we often hear is: how big of a model is needed and how to assess the trade-offs between cost and accuracy as model size grows for a particular use case?

Our objective was to create a standard framework to evaluate LLM accuracy on common retrieval-augmented generation queries, in which questions are looking to be answered based on a context passage.

Here is a list of some of the representative questions on the test set:

— What are the payment terms?
— What was the improvement in operating income year-to-year?
— According to the CFO, what led to the increase in cloud revenue?
— Who owns the intellectual property?
— What is the notice period to terminate for convenience?
— How often will the Board review the salary?
— What section of the agreement defines the governing law?
— When will the next dividend be paid?
— Is the expected gross margin greater than 70%?
— What is the amount of expected non-GAAP operating expense?
— What did economists expect for the trade surplus amount?
— Is Barclay’s raising its price target on KHC? (Yes/No)
— Were third quarter sales more than 5 billion euros? (Yes/No)
— Why did the intraday reversal occur? (Complex)
— What is a list of the top 3 financial highlights for the quarter? (Summary)
— What is a summary of the CEO’s statement in 15 words or less? (Summary)
— What are the key terms of the invoice? (Summary)

Introducing RAG-Instruct-Benchmark-Tester

This dataset consists of 200 sample questions and contexts, with answers, and is oriented towards financial services, legal and more complex business use cases, in which a question is asked of a business document, with the expectation that the LLM will read the text passage and provide a fact-based answer.

The test set is organized in several sections as follows:

Section 1- Core Q&A — 100 questions — this is the main component of the test, and is intended to provide a score between 0–100 for each model, based on the accuracy of answering the questions. For this main section, each question can be answered using the associated context passage. These first 100 questions provide a quick and easy way to give a comparative score for any LLM, and the short fact-based questions can be scored in a few minutes.

Section 2- Not Found Classification — 20 questions- for these questions, the answer is not contained in the associated passage, and the objective is to assess whether the model correctly identifies “Not Found” or attempts to reconstruct an answer using the information in the passage, or in the worst of all cases, the model “hallucinates” and draws on inaccurate information from outside of the passage. These questions are grouped so that they can be evaluated separately, as this skill has varying degrees of importance depending upon the specific use case, and depending upon the model fine-tuning, there can be wide ranges in the outcomes on this test.

Section 3- Boolean — Yes/No — 20 questions — for each of these questions, the expectation is that the model will provide a “Yes” or “No” answer, potentially with explanatory and supporting information. This is a critical test in some use cases in which a “threshold” or “classification” question is being posed to the LLM to assess whether a certain condition is true. Similarly, this skill is held out separately as its applicability can vary depending upon the use case, and different models will have different levels of training on making these assessments.

Section 4- Math/Logic — 20 questions — while there are many efforts to push the limits of LLMs to perform complex math, solve for equations and other sophisticated algorithms, for this test, we focused on “every day math” and “common sense” inferences with basic increments, decrements, sorting, ranking, and percentages involving amounts and times. Without specialized fine-tuning, we find that math performance is very poor from most LLMs. The objective of this test is not whether the LLM should be used to solve equations, but whether basic mathematical inferences can be made, along with the ability to quickly do relative comparisons among different models.

Section 5 — Complex Q&A — 20 questions — this is a mix of a set of different types of specialized questions to look at a variety of complex Q&A skills, such as multiple-choice, advanced table reading, multi-part extraction across a passage, causal (“why”) questions, and location selections. These tests can be used individually, or collectively. For our initial benchmark tests, we used these 20 questions to provide an overall assessment between 1–5 to rate a model’s capability.

Section 6- Summarization — 20 questions — we include summarizations presented in several different formats, with longer summaries, ‘x-summaries’ (looking for headline or 25 words or less), as well as specialized summaries of a particular topic, and across multiple passage types. For our initial benchmark tests, we used a qualitative review and overall rating between 1–5 to assess the model’s summarization capabilities.

We have started to benchmark all of our open source models using this framework — for a couple of good examples, please check out:

Best Performing 1.3B Instruct-following-LLM

Evaluated against the benchmark test: RAG-Instruct-Benchmark-Tester
Average of 2 Test Runs with 1 point for correct answer, 0.5 point for partial correct or blank / NF, 0.0 points for incorrect, and -1 points for hallucinations.

- Accuracy Score: 84.50 correct out of 100
- Not Found Classification: 20.0%
- Boolean: 66.25%
- Math/Logic: 9.4%
- Complex Questions (1–5): 1 (Low)
- Summarization Quality (1–5): 3 (Coherent, extractive)
- Hallucinations: No hallucinations observed in test runs.

This 1.3B parameter can run fast, free local inferences from a laptop, and in the context of “straightforward” extractive queries against business documents achieves 80%+ accuracy. We even did a Youtube demo of a simplified contract analysis with this model, and had even higher accuracy levels with simple “key-value” extraction questions. This model is also reasonably effective for Yes/No questions and summarizations. We use it often for internal testing of a workflow. While the model has poor recognition of “not found classification” capabilities, it did not generate any hallucinations. Please also feel free to check out the actual test results which we included in the files section of the repo in the link above.

Best Performing 2.7B Instruct-following-LLM

Evaluated against the benchmark test: RAG-Instruct-Benchmark-Tester
Average of 2 Test Runs with 1 point for correct answer, 0.5 point for partial correct or blank / NF, 0.0 points for incorrect, and -1 points for hallucinations.

- Accuracy Score: 94.0 correct out of 100
- Not Found Classification: 67.5%
- Boolean: 77.5%
- Math/Logic: 29%
- Complex Questions (1–5): 3 (Average)
— Summarization Quality (1–5): 3 (Coherent, extractive)
— Hallucinations: No hallucinations observed in test runs.

This nearly 3 billion parameter model can run on a CPU-based laptop (we tested extensively on a Mac M1 with 16 GB of RAM with no issues), but the speed is slow. It can also run at blazing speed on a GPU. Notably, this open source model is 90%+ accurate on extractive questions, with good Yes/No capability, not found classification, and decent summarizations. Note that the model has very limited math capability. On a positive note, we did not observe any hallucinations in any of the test runs.

This is part an ongoing series of blogs — in upcoming blogs, we will be reviewing the results of different models, and more comprehensive analysis of the RAG-instruct capabilities of 1B / 3B / 7B open source models, in particular. To stay updated, please check out other blogs, or follow / subscribe to this channel.

For more information about BLING models, please check out: LLMWare Bling RAG Instruct Models.

For more information about llmware, please check out our main github repo at llmware-ai/llmware/.

Please also check out video tutorials at: youtube.com/@llmware

--

--