Guidelines for Evaluating Large Language Models

Published in

Thomson Reuters Labs

8 min readJun 6, 2024

As we entrust more and more of our lives to AI solution, it becomes important to understand their strengths, limitations, and potential biases. In this post we will take you through our journey of building LLM evaluation framework which serves as a guiding compass to understand model performance with respect to quality of the output generated.

1. Introduction

The landscape of large language models (LLMs) is evolving at a breakneck pace, with new models and updated versions being released at an unprecedented rate. This rapid development brings with it a wave of innovation, but also presents several challenges. Each iteration pushes the boundaries of what LLMs can achieve, reflecting relentless efforts to drive innovation in language models. However, this rapid release cycle also poses the following challenges for users:

Evaluation and Benchmarking: The constant release of new models and versions makes it difficult to evaluate and compare their performance. Existing evaluation methods and benchmarks struggle to remain relevant, and the lack of a consistent and continuous evaluation mechanism hinders informed decision-making.

Accessibility and Adoption: The fast-paced innovation, development, and release of LLMs can be overwhelming, making it difficult to identify the most suitable LLM for specific tasks and requirements. This can slow down the process of productizing an LLM-based solution.

Ethical Concerns and Bias: With each new iteration, ethical concerns surrounding bias, fairness, and potential misuse of LLMs need to be carefully addressed.

Navigating the dynamic landscape of LLMs requires a careful balance between embracing innovation and addressing the challenges it presents. In this blog, we will delve into the complexities of LLM evaluation, exploring its key components and various methods employed to assess the capabilities of these powerful models.

Understanding how to evaluate and compare LLMs is crucial for making informed decisions about their application. However, the landscape extends beyond evaluation alone. In a follow-up blog, we will explore the world of benchmarking, leaderboards, and rating systems, providing further insights into the competitive landscape of LLMs and how we can gauge their performance and potential.

2. Need for LLM Evaluation

To ensure the effective assessment of performance and appropriateness of newly emerging large language models (LLMs), setting up comprehensive evaluation metrics and methods is crucial to assess the performance and application suitability of the constantly evolving LLMs from various providers. A thorough evaluation framework enables us to objectively decide whether an LLM meets specific requirements and effectively serves its intended purpose. Although there isn’t a single ideal approach that is universally optimal, any evaluation framework needs to encompass at least volume (data size), veracity (accurate and representative responses), and velocity (speed requirements). These are crucial considerations for building a robust evaluation framework.

As different language models are trained on different sets of data, a generic evaluation method may not capture their nuances or domain-specific nature. Additionally, an LLM solution needs to adhere to specific business requirements related to time, cost, SLA (Service Level Agreement) etc. which vary for each LLM business solution. Hence it is crucial to build custom evaluation frameworks to ensure the meaningful assessment of a language model’s capabilities for your own data, task and requirements.

3. Components of LLM Evaluation

An LLM Evaluation can be broadly decomposed into the following key components

Datasets: Dataset on which the performance of LLMs is evaluated for a task of interest.

Tasks: Specific challenges or problems in the dataset presented to LLMs to evaluate their capabilities, such as question answering, summarization, or translation.

Prompts: Instructions or contextual hints provided to guide the LLM’s response for a given task. For instance, here are some of the possible strategies we could utilize for prompting:

Instruction-Based Prompting: Explicitly provide the prompt to instruct the model what to do.
Few-Shot Prompting: Provide a few examples of the desired input-output pairs before the actual prompt. This helps the model understand the task and format better.
Chain-of-Thought Prompting: Encourage the model to break down complex problems into intermediate reasoning steps. This improves reasoning ability and leads to more accurate answers.
Role Prompting: Assign a specific role or persona to the model, influencing its style and tone.

Models: The specific LLMs being evaluated. This may include testing smaller and larger models as well as open-source versus proprietary ones.

4. Evaluation Methodologies

4.1 Classical Quantitative Evaluation

Depending on the tasks at hand, various traditional quantitative methods can be employed to evaluate a model’s performance. For instance, metrics such as F1, Precision, and Recall are typically used for classification tasks, while measures like BERT Score and ROUGE scores are utilized for summarization tasks. These traditional methods enable the evaluation of an LLM model’s responses by comparing them to a gold standard dataset

F1, Precision and Recall: To evaluate the performance of a classification model.

Perplexity: Measures the model’s ability to predict the next word in a sequence.

BLEU: Measures the similarity between machine-generated translations and human reference translations based on n-gram overlap.

Rouge: Used for evaluating the quality of text summaries by comparing them to reference summaries.

Movers Score: Measures how much the generated text “moves” away from the input text or context.

Bert Score: Computes a similarity score for each token in the candidate sentence with each token in the reference sentence.

and many more…

4.2 Alignment Evaluation

Alignment Evaluation for LLMs is a process to assess how well the model’s outputs align with the intentions of the input or the user’s expectations. It’s a measure of the model’s ability to accurately understand and respond to a prompt in a manner that is helpful, relevant, and contextually appropriate.

Faithfulness: Factually consistent or closely aligned with the input.

Answer Relevance: Relevance of generated answer for given Prompt.

Hallucination: Model output includes information that are not present in the source.

Toxicity: Output contains offensive language, hate speech, or content that is disrespectful or harmful.

Bias: Output contains gender, racial or political bias etc.

Ethics and Morality: Examines the potential societal impacts of LLM technology, such as job displacement or misuse of information.

4.3 Generalization and Robustness Evaluation

Generalization: Measures the model’s performance on data that hasn’t been used for training.

Reproducibility: Measures the model’s ability to produce the same or similar result for given input data.

Robustness: Measures the model’s sensitivity to noise, adversarial examples, etc.

4.4 Efficiency and Economic Evaluation

Evaluating the efficiency and economic impact of LLMs requires a comprehensive analysis considering various factors like time, cost, service-level agreements (SLAs), and token limitations.

Inference Time: Time taken by the LLM to process a prompt and input data and generate a response. Factors influencing this include model size, hardware, and complexity involved in the prompt, dataset and task.

Operational Cost: Encompasses ongoing expenses like cloud computing resources cost, API access cost, and maintenance cost.

Context Window: The maximum number of tokens the LLM can consider during processing, impacting the ability to handle long sequences of text.

Generation Length: The maximum number of tokens the LLM can generate in its response, potentially limiting the comprehensiveness of the output.

4.5 LLM Assisted Evaluation

The concept is akin to ‘a human evaluating another human’ or ‘a machine evaluating another machine’. As we already know, an LLM has powerful capabilities, which we can leverage to assess its own effectiveness. LLM-assisted evaluation is a method where language models are used to evaluate their own performance or that of similar models. In this case, we are using a prompt-based evaluation, where we experiment with different prompts and models to evaluate the output generated by the model. This approach provides us with more accurate and relevant results. We consider two cases here for illustration:

4.5.1 LLM Assisted Evaluation of LLM Response

In this process, the input prompt and contextual information are fed into the LLM-1 model, generating a response. This response, along with an evaluation prompt specifically designed for language model assessment, is then forwarded to another model, referred to as LLM-2. Subsequently, LLM-2 evaluates the model response based on the provided criteria, yielding a graded output.

Input Prompt and Context: The input prompt consists of the query or prompt provided by the user, along with contextual information such as documents or data. The large language model, LLM-1, is tasked with generating a response to the input prompt provided.

Post Prompt: An evaluation prompt specifically designed for assessing language model performance is added to the model-generated response. This prompt guides the subsequent evaluation process.

Output: The model-generated response, along with the evaluation prompt, is forwarded to another model LLM-2 for evaluation. LLM-2 then assesses the model response based on predefined criteria, producing a graded output that reflects the quality or relevance of the generated response.

4.5.2 LLM Assisted Post-processing of LLM Response

Post-processing is an iterative refinement process aimed at enhancing the quality of model-generated outputs. This method involves fine-tuning the prompt multiple times and passing it through the same language model to iteratively generate improved responses that align with all desired conditions or criteria.

In this process, the input prompt and contextual information are fed into the LLM-1 model, resulting in a generated response. Subsequently, to enhance the quality of the model’s response, a post-processing prompt is applied, which undergoes additional refinement iterations within the same or different LLM model, ultimately producing an improved response. Fine-tuning input prompts should be the primary focus for improving model performance across various tasks. Post-processing prompts should only be considered if further refinement is necessary in model response and can’t be handled via input prompt, as they introduce additional execution costs and time. Evaluation metrics should account for the efficiency of the approach, penalizing solutions that rely heavily on post-processing.

4.6 Human Evaluation

Human evaluation is essential for assessing the effectiveness of large language models as it offers insights that go beyond automated metrics. By involving human evaluators to review and rate LLM-generated outputs based on criteria such as coherence, fluency, and alignment with real-world expectations, we ensure that these models not only reflect statistical patterns in data but also meet the practical needs and expectations of users.

5. Conclusion

In conclusion, as LLMs have revolutionized natural language processing and continue to grow in complexity and capabilities, there is a need for a thorough and rigorous evaluation framework to untangle the capabilities and limitations of LLMs. While there have been many advancements in the evaluation of LLM responses for specific tasks, there is still a need for a holistic approach that can assess their strengths and weaknesses. Researchers and developers should continue to collaborate and explore effective methods to evaluate the performance and impact of LLM solutions.