Georgian Impact Blog

A blog focused on machine learning and artificial intelligence from the Georgian R&D team

Measurably Correct: Strategies for LLM Evaluation.

--

By Ben Wilde, Angeline Yasodhara and Rodrigo Ceballos Lentini

How do you know an answer to a question is ‘correct’? For anyone building an application that uses a Large Language Model (LLM) that is much more than just a philosophical question. There is a real need to be able to assess whether or not the generated output of an LLM-application is correct (or not). As LLMs are incorporated into more and more software applications, it may be useful to consider how to incorporate LLM testing into our engineering processes.

Model evaluation (or simply evaluation or ‘evals’) is the process of determining whether the output from an LLM is correct for a given input. Many readers may have already done an evaluation of a large language model by simply testing some prompts in ChatGPT and reviewing the results. Some might even have clicked on the thumbs up / down and in doing so will have provided OpenAI feedback for their model evaluation process.

For those coming from a more traditional software background, a useful analogy for evaluations is to think of evaluations as unit and integration tests for LLM applications. Given the non-deterministic nature and rapidly changing capabilities of LLMs, we think it is important to take a ‘test driven’ approach when building LLM-based software applications.

Try, then try again.

Evaluations start when developers begin evaluating the output of their first few prompts to see if they can get a reasonable answer from the model to which they have access. As they make changes to their prompts, developers continue to evaluate the results. Teams are doing an evaluation when determining whether or not a particular LLM (e.g. Llama3 vs. GPT-4 vs. Mistral) meets the particular requirements of the project.

Evaluations are also typically conducted to check for changes in output between variations of the same model (e.g. gpt-4–1106-preview vs. gpt-4–0613), changes to prompts and even changes to available context for RAG applications (e.g. adding new documents to an index). Teams also usually evaluate a model after fine tuning or after quantization.

That’s a lot of evaluation that needs doing and so development teams try to find ways to reduce the amount of manual evaluation required in each case.

Towards (more) scalable LLM evaluations.

The proposed approach outlined in Figure 1 below is based on our work with software companies. In this approach, we typically see teams make use of manual evaluation at the start of the LLM selection and prompt development process and at the end. In the middle, teams tend to apply more scalable techniques that enable the ability scale out testing across more data, edge cases, different LLMs, different fine tuning strategies, different versions of the same LLM etc. without requiring extensive manual evaluation.

Figure 1: Scaling LLM Evaluation

To illustrate this approach, consider an example scenario of extracting medical conditions and interventions mentioned in a short section of text. The first step is to try a simple prompt such as “extract conditions and interventions from the following text: …” Example input text and the desired output from the LLM is shown in Figure 2. Once we have adjusted our prompt as necessary to get the desired output, we may want to try it again on an additional sample or two.

Figure 2: Example Input Text and Desired Output (source)

Next we may consider how to curate a dataset we can use to evaluate model performance. Fortunately for our medical text example there is a public dataset available from HuggingFace that provides input and output pairs. The plan will be to use this across each of the LLMs that we evaluate to see which models appear able to provide the desired output given the input text.

After we have the prompt and dataset available for the evaluations, the next step is to decide on the metrics for evaluating the performance of one LLM vs. another. For example, we might choose to use the ‘correctness’ of the entities extracted by the prompt from the text as our metric. In this scenario we are using a relatively simple way to measure success for each LLM, therefore we may not need to take the final step of manually reviewing the top performing LLMs from the automated stage. In other cases, we may want to create a small set of additional prompt / answer pairs that are more complicated and that take into account edge cases or specific known requirements for a particular use case. However, for scenarios where there is no dataset available, and there is no direct evaluation method to test correctness, e.g. in cases like chatbot scenario or summarization, scaling the evaluation may be more difficult and manual evaluation techniques may be required.

Sourcing a (good) dataset.

When selecting the evaluation dataset, it may be prudent to include inputs that are representative of production data, unusual or edge cases, and high-value examples. The dataset should include examples of desired outputs or reference answers, although limited evaluation is still possible without these inputs.

There are several options for sourcing data including open-source datasets as in our earlier example, historical data that you have access to (and permission to use!), manual creation of new data, and data generated using LLMs. Indeed, one can use LLMs to generate data by expanding an existing dataset by creating new examples, such as generating synonyms for words as well as more sophisticated approaches such as ‘correct by construction’ and ‘resource overboard’.

Generating a dataset using LLMs

An alternative to sourcing existing datasets for LLM evaluation is to utilize LLMs themselves to generate evaluation data. Two approaches to generating such data are what we refer to as , ‘Correct by Construction’ and ‘Resource Overboard’ (also known as ‘Knowledge Distillation’).

The idea behind ‘Correct by Construction’ is that a data set can be designed in such a way that its correctness is ensured. For example, creating a LLM application that is able to generate API calls in Python from a natural language prompt. The input to the LLM would be an instruction asking the LLM to generate the API calls and the output would be code that can then be executed to confirm that the calls are valid. We note that it may be difficult to know if the right API call is being made given a particular prompt because it is possible that a correct call to the wrong API could be generated or the wrong parameters could be generated for the right API endpoint (e.g. a call to a weather API that returns the results in English when German was requested in the prompt).

One approach to test that the right API calls are being generated is to work backwards. In this scenario, we can start from the right API calls and ask the LLM to generate natural language descriptions of those API calls. Then, when we want to evaluate the model, we can use the LLM-generated descriptions as inputs (prompts) and check to make sure that the API calls (the LLM output) are the same as the ones that we used to create the descriptions in the first place. This approach enables us to confirm that the model is (or is not) generating the appropriate API calls for each possible user request.

Another example of this type of approach is to evaluate factual question answering via Retrieval Augmented Generation (RAG). Taking chunks of a document that we will be including in our knowledge base, we can ask an LLM to generate relevant questions for each chunk of text. Then we can ask those questions of the RAG application and confirm that the correct chunks of text are returned. This approach allows us to extend this ‘correct by construction’ approach outside of LLM evaluation to check the accuracy of the retrieval stage of the RAG application.

A second approach to generating data is what we will call “resource overboard” where we can use a more capable LLM, one that is also typically more expensive, to generate desirable outputs for the evaluation dataset. That dataset can be used to test smaller, less capable models to find good candidates for a particular use case. This approach may be useful for chatbots, text summarization and analysis where a model such as GPT 4o or Llama3 405B could be used to generate evaluation data to test suitability of smaller models such as GPT 4o mini, Gemma or Llama3 8B. Note, this approach is similar to the ‘teacher / student’ model where a ‘teacher’ LLM is used to generate training data that is then used to finetune a ‘student’ LLM.

Measure what matters

With the evaluation dataset in hand, it is time to think about how to quantify the quality of the responses back from the LLM-application. Here teams have a number of strategies available including:

  1. Traditional Natural Language Processing (NLP) metrics such as Rouge and BLEU.
  2. Property Based Metrics (e.g. number of tokens used)
  3. LLM-as-Judge, an emerging technique that uses a language model to grade output.

Traditional NLP Metrics

Examples include ROUGE, BLEU, and Levenshtein distance measure, all of which help us measure the accuracy of generated texts compared to reference texts, but which may not capture semantic similarity.

ROUGE measures how many words from the reference text are in the generated texts. By contrast, BLEU measures how many words from the generated texts are in the reference text. Another more traditional NLP measure is ‘Levenshtein distance’ which measures the minimum number of single character edits including insertions, deletions, or substitutions, required to change one word in the generated text into a word in the reference text. In each case, reference text, which we can measure against in order to calculate the metrics and may not be available, is needed.

Each of these measurements are measuring exact similarity. That is, they’re checking the words, the tokens and any character changes. They’re not actually checking whether two sentences, which are only slightly different, are actually saying the same thing. To achieve that we need to look at semantic similarity and that is what the BERT-score can provide us. It measures what we call the “cosine similarity of two pieces of text. It is worth noting that using semantic similarly as a measure of quality has limitations. For example, if you have two sentences, one that says “The stock price goes up” and the other that says “The stock price goes down” then they are both semantically very similar but the meaning is the opposite.

Property-Based Metrics

The idea with property-based metrics is that they are simple to measure, giving a quick way to test if the LLM is producing the expected ‘shape’ of output.

Examples include measuring the total length (in terms of tokens or words) of the output, giving an indication of whether the LLM is rambling (too long) or has truncated the answer. Average sentence length could also be measured to make sure the LLM is generating output that matches writing style expectations.

Vocabulary size is another property-based metric where the total number of unique words in a text are measured. For example, if a piece of text contains 100 unique words it has a vocabulary size of 100 words. A larger vocabulary size generally indicates a richer or more varied use of words, but it doesn’t provide insight into how these words are distributed throughout the text.

Lexical diversity can also be used to measure the variety of word usage within a text, reflecting the balance between unique and repeated words.It considers both vocabulary size and word repetition, so two texts could have the same vocabulary size but different lexical diversity depending on how often each word is repeated.

Finally, where the output is structured such as JSON, testing for valid JSON, inspecting the number of values etc. might provide indications of correctness of the output quickly.

LLM as Judge: LLMs Evaluating LLMs

Similar to the ‘Resource Overboard’ approach to generating evaluation data, the concept of LLM-as-Judge uses a more capable (and more expensive) LLM to evaluate the output of another typically smaller and cheaper LLM. This approach allows for the evaluation of open-ended responses because it doesn’t necessarily require reference answers ahead of time, especially when using a more capable model. For example, some models (e.g. GPT-4) can operate without reference answers / examples while others (e.g. GPT-3.5) which may be cheaper and less performance can perform similar evaluations if provided examples [ref].

The basic approach is that there is the LLM-application that needs to be evaluated and then a second evaluator LLM. The second LLM takes in three things: the user prompt, the response from the LLM being evaluated, and an evaluation prompt. The evaluator LLM then generates an evaluation score.

This approach can approximate human evaluation with LLMs as long as we have the right prompt. In our own work, we have seen it used for measuring creativity, helpfulness, readability and groundedness (testing that the information in the answer is based on a given context / the user input or not).

Figure 3: LLM as evaluator.

Implementing Evaluations.

In our view, it is unlikely that there will be ‘one metric’ that makes sense for any particular use case and teams will likely end up using a mix of metrics. For example, for measuring summarization performance of an LLM, a team might use both property based metrics (length of output text is less than input text) as well as the popular ROUGE. For Retrieval Augmented Generation use cases, the test might first check the LLM output length to observe if the LLM has ‘rambled’ before applying a LLM-as-Judge metric such as RAGAS.

In each case, the benefit is the use of an easy to calculate heuristic as a ‘sanity check’ before applying the more computationally expensive metric.

In terms of implementing simple property-based metrics such as word or token counts, those metrics can just be calculated in Python directly. For traditional NLP evaluation metrics HuggingFace’s Evaluate library [link] provides support for traditional metrics such as ROUGE and BLEU, as well as for getting access to pre-trained models such as those models that detecti toxicity.

For “LLM-as-Judge’’ evaluations, there are a number of open source options to consider including: RAGAS [link], Deepeval [link] (which includes RAGAS and 13 other metrics), Uptrain [link]. Some of these evaluation frameworks, such as Promptfoo, span both LLM-as-Judge and more deterministic metrics, including whether or not the LLM output contains certain words [ref].

There are also a growing number of platforms emerging to tackle the challenge of end-to-end LLM evaluations including Langsmith, Langfuse, Galileo and Deep Checks.

For example, Galileo’s Evaluation Intelligence Platform for AI. Galileo’ platform is intended to help teams evaluate model performance, detect hallucinations, optimize prompts with the goal of improving production reliability [ref]. Modules include experimentation tools, real-time monitoring, and AI protection. More information is available at galileo.ai.

CleanLab is another platform that provides developers an API as part of its CleanLab Studio that wraps LLM interactions with its “Trusted Language Model” (TLM) [ref] in order to provide a confidence score for any LLM response [ref], which can be used to automate screening of erroneous responses.

For developers using LLM-application development frameworks such as LangChain, LlamaIndex, DSPy and Haystack, there are a range of options for incorporating evaluations including. These options include providing built-in evaluation capabilities, the ability to write custom evaluations, and in most cases integrations with evaluation frameworks such as DeepEval or other third party evaluation frameworks [ref] [ref] [ref].

In our view the market is early and teams should stay close to developments over the near term including evaluating multiple platforms to see which one fits a particular use case.

LLM-as-Judge considerations. While the LLM-As-Judge approach shows promise, it is an active area of research and development. Potential issues include bias in the judge model due to lack of knowledge on the part of the Judging LLM. For example if the LLM acting as judge has not been sufficiently exposed to domain specific-knowledge it might incorrectly penalize a fine-tuned model’s outputs. Further, the judge LLM might have its own inherent biases or preferences based on its training data, making it less sensitive to the strengths (and weaknesses) of the specialized model.

Conclusion

As teams move from a purely experimental approach to an approach where LLM-applications are being moved towards production deployments, we think it is important to take an evaluation / test driven approach to LLM application development. Once in production, evaluations should continue in a continuous fashion. With each prompt or model change, LLM performance should be reevaluated and user adoption and user feedback should be closely monitored to improve the system. As with many aspects of Generative AI right now, given the pace of change in the field at the time of writing, we think teams should take a ‘crawl / walk / run’ approach. This approach helps teams quickly learn, stay close to new developments and switch tools and techniques for evaluations as necessary.

--

--

Georgian Impact Blog
Georgian Impact Blog

Published in Georgian Impact Blog

A blog focused on machine learning and artificial intelligence from the Georgian R&D team

Georgian
Georgian

Written by Georgian

Investors in high-growth business software companies across North America. Applied artificial intelligence, security and privacy, and conversational AI.

No responses yet