Replacing Judges with Juries : LLM Generation Evaluations with Panel of LLM Evaluators

SACHIN KUMAR
8 min readMay 1, 2024

--

Evaluating the correctness of a LLM model’s free-form generation in an automated manner is more challenging and specially time intensive if human annotators are involved. For addressing it, LLMs are used as judges to score the quality of outputs from other LLMs, which then can cause intra-model bias.

In this recent paper[1], authors introduced Panel of LLm evaluators (PoLL). Across three distinct judge settings, authors of [1] find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.

Key contributions of paper:

  • evaluate LLM generations using a Panel of LLm evaluators (PoLL ) drawn from different model families rather than a single large judge
  • shows that using an instantiation of PoLL correlates better with human judgements compared to a single large judge (GPT-4), while being over seven times cheaper.
  • In some scenarios, GPT-4 is a relatively weak judge, exhibiting high variance with minor changes to the prompt.
  • Intra-model scoring bias is reduced by pooling judgements across a panel of heterogeneous evaluator models
Top: Rankings of model performance change drastically depending on which LLM is used as the judge on KILT-NQ. Bottom: The Panel of LLm evaluators (PoLL) has the highest Cohen’s κ correlation with human judgements.

Methods

i) Background: LLM as a Judge

  • A judge evaluator model J is used to score the output a from a test model A.

a) Single-point Scoring

  • Evaluator model J is tasked with rating the quality of a single model output independently of any point of comparison
  • Jprompt will often include natural language instructions on how the grading should be performed
  • Rating is based solely on J’s internal model of what a quality output is, with score = J(a)

b) Reference-based Scoring

  • model is provided with some ’gold’ reference r, which contains the information that should be included in a

c) Pair-wise Scoring

  • goal is to choose which of two outputs is better.
  • Given outputs a and b generated by two models A and B, an evaluator J compares them and generates a preference score over the outputs as score = J(a,b)^2

ii) Panel of LLM Evaluators

  • scoring answer correctness based not on a single judge, but instead on a panel composed of multiple evaluator models.
  • To calculate the PoLL score, each evaluator model independently scores a given model output just as they would in any of the scenarios outlined above for LLM as a Judge.
  • Those individual scores are then pooled together through a voting function3 such that the final score = f(j ∈ P : j(a)) where P is a panel composed of individual judges j and f is a voting function.

Experimental Settings

i) PoLL Composition and Voting

  • PoLL was constructed from three models being drawn from three disparate model families (Command R, Haiku, and GPT-3.5).
  • two different voting functions used for aggregating scores across the judges.
  • For QA datasets, authors used max voting, as all judgements are binary [correct, incorrect].
  • For Chatbot Arena authors used average pooling because judgements are scores ranging from 1–5 and a three judge panel often does not produce a clear majority decision.

ii) Model Families Used

  • Command R Family: Command R was used as one of the models in the PoLL.
  • GPT Family: GPT-3.5 is used as a member of PoLL.
  • Claude-3 Family: Cluade’s Haiku models was used in the PoLL.
  • Mistral Family: experiment with Mistral judges was not conducted but their generations were evaluated as a point of comparison to have a model ’unaffiliated’ with any judges.

iii) Single-hop Question Answering

  • question answering (QA) tasks used were open-book settings where a model m is given a question q and must retrieve evidence e from some retrieval system (such as the internet or dense index over wikipedia) and must generate an answer g as g = m(q, e).
  • Datasets used were datasets from KILT [2] , versions of Natural Questions (NQ) [3], TriviaQA (TQA)[4] , and HotpotQA(HPQA) [5].

iv) Multi-hop Question Answering

  • questions are designed such that models must perform multiple rounds of retrieval to answer sub-questions and collect sufficient evidence to ultimately answer the initial question.
  • experiments conducted on two datasets: Bamboogle [6] and HPQA.

v) Chatbot Arena Hard

  • benchmark for evaluating LLM head-to-head performance
  • Authors treated Chatbot Arena crowdsourced annotations as ground truth for calculating correlation between evaluator models and human judgements.

vi) Prompting Judges

  • judge models need to be prompted in different ways depending on the particular task setup.
  • Author’s QA experiments use reference-based scoring and our models prompts contain few-shot in-context examples of valid and invalid q, a, r triples.
  • authors used the ’containment’ version of EM from prior work which is more amenable to LLM long-form generation [7] and checks if a reference answer string appears within the generated model response (after normalization).

vii) Human Judgements

  • For gathering human reference judgements, authors utilized Cohere’s internal highly-qualified annotation workforce.
  • Annotators were shown a single anonymized model generated answer at a time along with the original question and reference answer.
  • Annotators were asked to judge whether the reference answer is semantically contained inside the generated answer.

Evaluation Results

i) Correlation to Human Judgements

  • Cohen’s κ Correlation: Cohen’s kappa measures inter-rater reliability, which quantifies the level of agreement between two or more raters or judges.
  • Tabe below shows Cohen’s Kappa Judge Model Performance on Different Single-hop QA Datasets from KILT, where best results are indicated by the bold font and second best results are underlined.
  • In table above, it can be observed that overall, PoLL has the strongest correlation across various tasks, while GPT-4 is one of the weaker evaluators on this particular task setup.

ii) Rank Correlation on Chatbot Arena

  • Table below shows Pearson and Kendall-Tau correlations between different judge models as compared to the rankings produced by the Chatbot Arena overall leaderboard.
  • In table above, it can be observed that PoLL is best correlated with the gold rankings, particularly at the top of the ranked list.

iii) Judgement Variance by Prompt Changes

  • Based on the observation that GPT-4 was the weakest judge model on our KILT evaluations, authors investigated how the model reacts to modifications to its prompt.
  • GPT-4 is the most powerful judge model we tested, yet it performed worse than less capable models on what is essentially a fuzzy string matching exercise.
  • That could be because GPT-4 is over-reasoning and injecting too much background knowledge into determining the correctness of an answer rather than simply aligning the gold reference with the generation.
  • Table below shows Kappa values on NQ for different prompt variants with GPT-4 as judge.
  • In the table above, it can be observed that how the correlation between GPT-4 and human annotators varies as the prompt changes.
  • With explicit instruction to the model not to ’overthink’ brings the agreement level for GPT-4 up to the level of GPT-3.5 when using few-shot standard prompt, though still below Command-R, Haiku, and PoLL.

iv) Judge Bias and Consistency

  • delta comparison was done for absolute accuracy score for individual judges and PoLL with scores provided by human annotators across multi-hop QA datasets.
  • Figure below shows Accuracy changes of different evaluation judges as compared to human judgements on HotPotQA (multi-hop).
  • Figure below shows Accuracy changes of different evaluation judges as compared to human judgements on Bamboogle.
  • As observed from figures above, overall, PoLL has the smallest spread in scores, with a standard deviation of 2.2, compared to EM and individual judges. GPT-3.5 has the highest spread, with a standard deviation of 6.1. Highest positive delta for each individual model being scored occurs when it is judged by itself.
  • Figure below shows Rankings of model performance on Chatbot Arena Hard judged by GPT-4 or PoLL. Ranks are compared to those in the original Chatbot Arena.
  • As observed from figure above,PoLL rankings correlate better with the ground truth, particularly at the top of the ranked list. Also, intra-model bias can be observed as the GPT-4 judge ranks another GPT-4 variant in position 2, higher than its actual position 4.

v) Cost and Latency

  • cost of running author’s specific instance of PoLL is $1.25/input11 + $4.25/output, whereas the cost of running GPT-4 Turbo is $10/input + $30/output.
  • Depending on the ratio of input-to-output tokens in a given task, running the entire three model PoLL is seven to eight times less expensive than running a single GPT-4 judge.

Conclusions and Limitations

  • authors showed how a Panel of LLM Evaluators composed of smaller models is not only an effective method for evaluating LLM performance, but also reduces intra-model bias, latency, and cost.
  • Limitations exists, only three evaluator settings and a limited number of judges and panel compositions.
  • PoLL is showed as an effective alternative to a single large model in these settings, further work is needed to see how broadly applicable the method is like in math or reasoning evaluations
  • task of ’panel selection’, or identifying the best models to include in PoLL in terms of quality and cost is left as future work.

References:

  1. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models by Verga et al. arXiv:2404.18796
  2. Kilt: a benchmark for knowledge intensive language tasks. by Petroni et al. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics
  3. Natural questions: A benchmark for question answering research. by Kwiatkowski et al. Transactions of the Association for Computational Linguistics
  4. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. by Joshi et al. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
  5. HotpotQA: A dataset for diverse, explainable multi-hop question answering. by Yang et al. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
  6. Measuring and narrowing the compositionality gap in language models. by Press et al. In Findings of the Association for Computational Linguistics: EMNLP 2023
  7. Lost in the Middle: How Language Models Use Long Contexts. by Liu et al. Transactions of the Association for Computational Linguistics

--

--