All about LLM Evals

9 min readMar 25, 2024

Over the past year, I’ve been engaged in building applications powered by large language models (LLMs), in addition to having extensive conversations with some bright minds at several leading LLM/AI companies. One common pain point I’ve identified through these discussions is the lack of easily pluggable evaluations for both the model and application-level LLM systems. Specifically, there’s often confusion regarding the vast array of LLM evaluation benchmarks, and when to utilize machine feedback, human feedback, or a combination of both.

Here, I aim to share a selection of the most useful readings I’ve encountered regarding LLM Evals.

Regarding my background: I have been a PM/partial founder focused on developing AI/ML-powered applications, with previous experience at AWS AI.

What is Evaluation?

Evaluation, often shortened as ‘Evals’, is the systematic assessment and measurement of the performance of LLMs and their applications. Think of evaluations as a series of tests and metrics meticulously crafted to judge the “production-readiness” of your application.

Evals are crucial instruments that offer deep insights into how your app interacts with user inputs and real-world data. Robust evaluation of your application means ensuring that it not only adheres to technical specifications but also resonates with user expectations and proves its worth in practical scenarios.

What makes a good eval?

A good evaluation is —

Covers the most important outcomes of your LLM-application
A small number of metrics, which are interpretable, preferably
Fast and automatic to compute
Tested on a diverse & representative dataset
Highly correlated with human judgment

Better data, better metrics -> Better Evals — Source

Traditional Evaluation Metrics

In NLP, traditional metrics have played a pivotal role in shaping our understanding of language models and their capabilities. From precision and recall to BLEU and ROUGE scores, these metrics have offered a way to quantitatively assess the performance of various models and algorithms. They have been crucial in benchmarking progress, comparing different approaches, and setting the direction for future research and development.

Timeline of the introduction of NLP metrics and their original application — Source

However, as the complexity of language models, especially LLMs, continues to grow, the limitations of traditional metrics become increasingly apparent. This shift calls for re-evaluating how we measure success and effectiveness in NLP, leading to the exploration of more refined metrics that can keep pace with the advancements in the field.

Limitations of Traditional Metrics

Take, for instance, the BLEU (Bilingual Evaluation Understudy) score, a common metric used in machine translation. BLEU evaluates the quality of translated text by comparing it with a set of high-quality reference translations. However, its focus is predominantly on the precision of word matches, often overlooking the context and semantics.

As a result, a translation could score high on BLEU for having words in technically correct order, but still miss the mark in conveying the right tone, style, or even meaning as intended in the original text.

Text Similarity with BLEU drops drastically just by using different words with similar meaning — Source

The Evolution of Evaluation: The Rise of LLM-Assisted Evals ⚖️

This innovative approach marked a significant shift in evaluation methods, reflecting a broader trend: the tools we develop to understand human language are also becoming the benchmark for evaluating themselves. As these language models advanced, the metrics evolved from non-traditional to what we can now refer to as LLM-assisted evals.

Source — All about evaluating LLMs

In the current era of modern LLMs, the same principle applies but on a more sophisticated scale. Researchers are now employing LLMs like GPT-4 to evaluate the outputs of similar models. This recursive use of LLMs for evaluation underscores the continuous cycle of improvement and refinement in the field. By using LLMs as both the subject and the tool of evaluation, we unlock a deeper level of introspection and optimization.

. Some of the most impactful papers that have popularized this approach include:

GPTScore: A novel evaluation framework that leverages the zero-shot capabilities of generative pre-trained models for scoring text. Highlights the framework’s flexibility in evaluating various text generation tasks without the need for extensive training or manual annotation.
LLM-Eval: A method that evaluates multiple dimensions of conversation quality using a single LLM prompt. Offers a versatile and robust solution, showing a high correlation with human judgments across diverse datasets.
LLM-as-a-judge: Explores using LLMs as a surrogate for human evaluation, tapping into the model’s alignment with human preferences. Demonstrates that LLM judges like GPT-4 can achieve an agreement rate exceeding 80% with human evaluations, suggesting a scalable and effective method for approximating human judgments.

Human and GPT-4 judges can reach above 80% agreement on the correctness and readability score. If the requirement is smaller or equal to 1 score difference, the agreement level can reach above 95% — Source

Limitations of LLM-Assisted Evaluations

While LLM-assisted evaluations represent a significant leap in NLP, they are not without their drawbacks. Recognizing these limitations is key to ensuring accurate and meaningful assessments.

Application-Specific: One major constraint is that LLM-driven evaluators produce application-specific metrics. A numeric score given by an LLM in one context does not necessarily equate to the same value in another, hindering the standardization of metrics across diverse projects.
Position Bias: According to a study, LLM evaluators often show a position bias, favoring the first result when comparing two outcomes. This can skew evaluations in favor of responses that appear earlier, regardless of their actual quality.
Verbose Bias: LLMs also tend to prefer longer responses. This verbosity bias means that more extended, potentially less clear answers may be favored over concise and direct ones.
Self-Affinity Bias: LLMs may exhibit a preference for answers generated by other LLMs over human-authored text, potentially leading to a skewed evaluation favoring machine-generated content.
Stochastic Nature: The inherent fuzziness within LLMs means they might assign different scores to the same output when invoked separately, adding an element of unpredictability to the evaluation.

To mitigate these biases and improve the reliability of LLM evaluations, several strategies can be employed:

Position Swapping: To counteract position bias, swapping the reference and the result in evaluations ensures the outcome being assessed is in the first position.
Few-shot Prompting: Introducing a few examples or prompts into the evaluation task can calibrate the evaluator and reduce biases like verbosity bias.
Hybrid Evaluation: To achieve a more grounded evaluation, integrating LLM-based assessments with human judgment or advanced non-traditional metrics can be highly effective. This combined approach offers a comprehensive assessment framework that balances the innovative capabilities of LLMs with the proven accuracy of non-traditional metrics.

From Theory to Practice: Evaluating Your LLM Application 🔍

Here’s a broad categorization of LLM applications, each with its unique context:

Simple LLM Wrappers: User-friendly interfaces that connect users directly with an LLM for general-purpose tasks like summarization, extraction, and content generation.
RAG (Retrieval Augmented Generation): Complex systems that combine LLMs with additional data sources to enrich the model’s responses with more precise and contextually relevant information.
Agents: Advanced autonomous agents equipped with multi-step reasoning, capable of navigating complex tasks that mimic a human’s decision-making process.

The evaluation approach for each of these application types will differ, tailored to their specific functionalities and user requirements.

Evaluation Methodology

The journey of evaluating an LLM application ideally follows a structured framework, incorporating a suite of specialized tools and libraries. By systematically applying evaluation methods, we can gain meaningful insights into our applications, ensuring they meet our standards and deliver the desired outcomes.

Source — Best Practices for LLM Evaluation of RAG Applications

Step 1. Crafting a Golden Test Set:

The evaluation begins with the creation of a benchmark dataset, which should be as representative as possible of the data the LLM will encounter in a live environment. This is often referred to as a ‘gold test set’ — the standard against which the LLM’s performance is measured.

Many libraries allow us to generate test synthetic test sets, such as Langchain’s QA generation chain, llama-index, and ragas, each utilizing unique techniques to generate an effective test set.

The creation of a diverse and representative dataset is a crucial step, that sets the foundation for a holistic assessment of your LLM application. In this context, I utilized RAGAS’ test-set generation feature; whose approach to creating evaluation sets is particularly effective in mirroring real-world scenarios, making it an excellent choice for accurately measuring an application’s performance.

Step 2. Grading the Results:

Once we have our evaluation test set complete with ground truth and responses generated by our LLM application, the next step is to grade these results. This phase involves using a mix of LLM-assisted evaluation prompts and more integrated, hybrid approaches.

There’s a plethora of open-source libraries equipped with ready-to-use evaluation prompts, each offering unique features and methodologies.

Langchain

Source — Langchain: How correct are LLM Evaluators?

Llama-Index

RAGAS

RAGAS Evaluation Metrics — Source

TruLens

TruLens by TruEra is an innovative tool and a notable contender in the world of Large Language Model Operations (LLMOps). It helps developers objectively measure the quality and effectiveness of LLM-based applications using a set of feedback functions, for feedback-driven analysis, interpretability, and effectiveness metrics.

Source — Trulens: Evaluate and Track your LLM Experiments

These feedback functions enable developers to programmatically evaluate the quality of inputs, outputs, and intermediate results. The library is versatile, supporting a wide range of use cases and integrations.

LLM-augmented Eval with human-in-the-loop

Below is a chart from a recent Scale AI paper illustrating one possible flow of incorporating machine + human feedback in the evaluation framework.

The crux is how to select the most effective human evaluators, how many do you need and can we trust these humans as the final adjudicators.

Two quick thoughts here as I previously designed a human-in-the-loop labeling platform: Try to incorporate direct customer feedback and if not possible, the human evaluators should closely mimick the distribution of the target customer segements of your LLM application.

Secondly, the machine feedback and human feedback modules should be flexible — each LLM application should have its own unique machine/human feedback module sequences. For example:

Flow One:

Deterministic Evaluation → Machine feedback → human feedback

Flow Two:

Machine feedback → human feedback → Machine feedback → Deterministic Evaluation

Here Flow two incorporates another layer of machine check post human feedbacks, which can be quite useful when you are just starting to select the most suitable human evaluators and may not be 100% confident about their adjudication accuracy.

References