An Introduction to Large Language Models Evaluations
A critical step to develop large language model (LLM) based applications is to assess how good is the LLM in solving the task at hand. This step allows not only to know the performance of the LLM but also to carry out actions to improve it. In this post I review the details about doing evaluations in LLM applications.
What are LLM evaluations?
LLM evaluations are the processes to evaluate the LLM against certain metrics that can depend on the task that the LLM is supposed to solve or on the actual functioning of the LLM itself. For example, if the task of the LLM is to summarize a body of text, a metric could be defined as the “quality” of the summary generated by the LLM.
Different metrics divided by their task
Each of these statistical metrics provides a different perspective on model performance and helps in understanding the strengths and weaknesses of LLMs in various tasks. A great post about statistical metrics can be found here.
Text Generation
- Perplexity: Measures how well a probability model predicts a sample. Lower perplexity indicates better performance.
- BLEU (Bilingual Evaluation Understudy): Compares the overlap of n-grams between generated text and reference text.
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures the overlap of n-grams, word sequences, and word pairs between generated text and reference text.
- METEOR (Metric for Evaluation of Translation with Explicit ORdering): Considers synonymy, stemming, and alignment between generated and reference text.
Text Classification
- Accuracy: The ratio of correctly predicted instances to the total instances.
- Precision, Recall, and F1 Score: Precision measures the proportion of true positives among the predicted positives, recall measures the proportion of true positives among the actual positives, and the F1 score is the harmonic mean of precision and recall.
- Confusion Matrix: Shows the breakdown of true positives, true negatives, false positives, and false negatives.
Question Answering
- Exact Match (EM): Measures the percentage of predictions that match any one of the ground truth answers exactly.
- F1 Score: Measures the overlap between the prediction and the ground truth answer, considering both precision and recall.
- Mean Reciprocal Rank (MRR): Measures the rank of the first correct answer, with higher ranks yielding higher scores.
Summarization
- ROUGE: Particularly ROUGE-N (for n-gram overlap) and ROUGE-L (for longest common subsequence).
- BLEU: Although less common for summarization, it can still be used to measure n-gram overlap.
- Content Overlap: Measures how much important content from the source text is preserved in the summary.
Machine Translation
- BLEU: Commonly used to measure the accuracy of translated segments compared to reference translations.
- METEOR: Accounts for synonymy and word order, providing a more nuanced evaluation.
- TER (Translation Edit Rate): Measures the number of edits required to change the system output into one of the references.
Language Understanding (e.g., Sentiment Analysis)
- Accuracy: Measures the proportion of correctly predicted labels.
- Precision, Recall, and F1 Score: Provide a detailed evaluation of the model’s performance across different classes.
- ROC-AUC (Receiver Operating Characteristic – Area Under Curve): Measures the trade-off between true positive rate and false positive rate.
Named Entity Recognition (NER)
- Precision, Recall, and F1 Score: Evaluates the model’s ability to correctly identify entities.
- Entity-Level F1 Score: Measures the accuracy of entity boundaries and types.
Dialogue Systems
- BLEU: Measures the overlap between generated and reference responses.
- METEOR: Considers synonymy and word alignment for generated responses.
- Human Evaluation: Involves human judges rating the quality, relevance, and coherence of the responses.
How to implement LLM evaluations and tools
These metrics can help how to determine the performance of the LLM. There are different Python modules that have implementations for most of these metrics.
For example, for a summarization task you can start with rouge and bleu scores [1,3] as implemented below:
# ROUGE metrics
from rouge_score import rouge_scorer
# BLEU metrics
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
# define the reference and candidate sentences
reference = "The patient has no emergency contact, no occupation, no marital status, no dental insurance, no history of smoking, no alcohol usage, no recreational drug use, no height, no weight, and no dental anxiety. The reason for scheduling this visit is unknown."
candidate = "The patient is 43 years old, he has no ocupation, no marital status, he dental insurance, no history of smoking, no alcohol usage, no recreational drug use, no height, no weight, and no dental anxiety. The reason for scheduling this visit is unknown."
# apply metrics
# ROUGE
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
# BLEU
# define the smoothing function
smoothie = SmoothingFunction().method4
bleu_score = sentence_bleu(reference, candidate, smoothing_function=smoothie)
# print the results
print(f"ROUGE: {scores}")
print(f"BLEU: {bleu_score}")
# log the results in a dataframe
metrics = pd.DataFrame({
'ROUGE': [scores],
'BLEU': [bleu_score],
})
Other modules that can be used for more metrics are: sumy, rouge-score, bert-score, etc. For example, using sumy different rouge metrics can be evaluated:
from sumy.evaluation import rouge_1, rouge_2, rouge_l
rouge1_score = rouge_1(hypothesis, reference)
rouge2_score = rouge_2(hypothesis, reference)
rougel_score = rouge_l(hypothesis, reference)
One important thing to note about statistical metrics is that in general they compare the “difference” or “overlap” between the generated or hypothesis text and the reference without reasoning about the meaning but stressing of the occurrence of n-grams (or consecutive words).
LLM as a judge
Another way to evaluate the output of an LLM in reference to a task is to use another LLM (as a proxy for a human judge). This type of evaluations, in contrast with statistical metrics, pays attention to or reasons about the meaning of the output.
The way that this type of evaluation works is basically, construct a prompt for which the judge LLM generates a ranking from 1–5 (for example) for the candidate or hypothesis according to a certain task (which could the comparison between candidate and reference texts). One opensource evaluation project you can use for this and other approaches is DeepEval.
Once you have a way to evaluate the performance of the LLM to carry out a certain task under some metrics, then modifications can be made to the parameters of the LLM (eg fine tuning) or the prompt itself (prompt engineering).
Usually, the modification of the prompt to improve the performance of the LLM is done manually and based on a trial and error approach. This can not only be tedious but it is difficult to reproduce and keep track of. There attempts to do this in a more systematic way to increase consistency and efficiency to the prompt engineering process [2].
Human based evaluations
A very important step in developing good evaluations for an LLM application is to have a human based evaluation first. This step is very important to develop further refinements for your automated evaluation (particularly for the custom evaluations). This will provide a baseline or target to improve your evaluations and metrics. In the simples case, just read the source and the generated evaluation. That could develop the intuition to improve the prompt of the judge llm and potential metrics that could be used in an automated manner.
Summary
Once you define the task for an LLM to solve, its performance needs to be assessed. The first step is to define the metrics that will be used for this LLM evaluation. Having good metrics to evaluate the performance of the LLM for the particular task at hand is critical. Once that is done, the systematic evaluation of the LLM is the basis for improve its performance. In this post an introduction to LLM evaluation was provided with simple generic metrics for diverse tasks and examples of some the pyhton modules that have evaluation functionality for some of the metrics.