Llama2 in LLM-as-a-judge

Can Llama 2 replace GPT-4 for chatbot evaluation?

9 min readAug 23, 2023

tldr; Today the Open LLM leaderboard lists 1042 open source chatbot models, and it’s growing fast, the community adds new ones every day. Ideally, we would judge these chatbots ourselves, manually, but there are just way too many. One way of evaluating chatbots at scale is to use GPT-4 as a judge. Here, I asked how well judge Llama 2 performs compared to GPT-4 (answer: surprisingly good!), if its judgement also agrees with human evaluation (yes), and if we can replace GPT-4 with Llama 2 to judge the quality of other LLMs at scale (probably not [yet]).

This is a follow-up of the story “Reproducing Guanaco”, where I stumbled upon increasing evaluation loss while finetuning Llama 2 on conversational data. Using the “LLM-as-a-judge” approach published by LMSYS, I saw that GPT-4 “thinks” the model improves with additional epochs of training, despite increasing evaluation loss.

Why do we care about what GPT-4 thinks?

The GPT-4 based evaluation apparently agrees well with human evaluation, “achieving over 80% agreement, the same level of agreement between humans”, according to the authors of the MT-Bench paper. Pretty cool, we can have machines instead of humans judge how well other machines perform, at scale. But also pretty expensive, using GPT-4 for this experiment cost me ~100 USD.

The MT-bench paper ends with

Future directions. 1) Benchmarking chatbots at scale with a broader set of categories 2) Open-source LLM judge aligned with human preference 3) Enhancing open models’ math/reasoning capability

and this is were I started.. downloaded the best performing model from HuggingFace (upstage/Llama-2–70b-instruct-v2 at the time), modified the MT-Bench code, and use this 70B open source model as a judge instead of paying for GPT-4. Below are the results.

**Figure 1**. **Bar plots (left to middle)**: Agreement of Judges GPT-4 and Llama 2 with human evaluation of 6 models (gpt-4, gpt-3.5-turbo, vicuna-13b-v1.2, claude-v1, alpaca-13b, llama-13b). Agreement between humans is shown as grey bars. Each judge was asked to perform a pairwise comparison of the models’ answers to 80 questions and one follow-up question (=a multiturn conversation), in a pairwise fashion, all-against-all (2400 judgments by each judge). The GPT-4 and human judgment was generated/collected by LMSYS, Llama 2 judgment was generated by me (see code for details), analysis was performed using MT-bench (github, paper). Note that the numbers do not exactly match the numbers in their paper, if you really care, contact me, I will explain. **Scatter plot (right):** Correlation of GPT-4 versus Llama 2 judgments. Each dot is the adjusted winrate of one model as judged by GPT-4 and Llama 2 (=upstage/Llama-2–70b-instruct-v2), ties included, errors excluded. PCC=Pearson correlation coefficient.

I expected this to be a complete waste of time but was hooked by the initial results and had to dig deeper.

What is shown in the figure above?

In simple words ..

GPT-4 is a good judge of conversations, good in a sense that its opinion agrees well with human evaluation, it’s on the same level as the agreement between humans. We knew that already.
Llama 2 (actually upstage/Llama-2–70b-instruct-v2) performs worse as a judge, it’s performance improves, however, if judgments that contain ties and errors are excluded from the analysis.

What are ties and errors?

Spoiler alert: The judgment quality of Llama 2 is lower than GPT-4.

Ties

MT-Bench works by presenting one question and two answers (generated by two models being evaluated) to the judge, who is asked to decide which model gave the better answer or output “tie” if he cannot decide.
Also, MT-Bench asks the judge twice for each pair of answers with the order swapped to assess a positional bias. For example, a judge might always favor the first answer over the second one. If the judge changes his mind when the order of answers is swapped, basically being inconsistent, MT-Bench counts this as a tie too.
GPT-4 and Llama 2 both produce ties.

Errors

This means the judge’s output cannot be parsed (examples below). The judges output, just like any chatbot, is unstructured and has to be parsed. Even though MT-Bench asks for a specific format, the judge sometimes does not follow it.
This almost exclusively happens with Llama 2.

Some numbers

**Figure 2**. 2400 judgments collected from GPT-4 and Llama 2. Errors (judge output could not be parsed) in red (GPT-4: 2), inconsistent (judge changes his mind when the order of presentation of answers is swapped) in yellow, consistent judgments in green. The number of “consistent” ties, ie. the judge thinks it’s a tie independent of the order of presentation, is roughly the same for both judges (GPT-4: 220, Llama 2: 187; data not shown in the figure)

Examples

Just to illustrate the process, here is an example of what a prompt to the judge looks like:

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: “[[A]]” if assistant A is better, “[[B]]” if assistant B is better, and “[[C]]” for a tie.
[User Question]
Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions.
[The Start of Assistant A’s Answer]
…
[The End of Assistant A’s Answer]
[The Start of Assistant B’s Answer]
…
[The End of Assistant B’s Answer]

Ideally, the judgment looks something like this (in .json format, this is how MT-Bench stores it):

"question_id": 81, 
"model_1": "alpaca-13b", 
"model_2": "gpt-3.5-turbo", 
"g1_winner": "model_2", 
"g2_winner": "model_2", 
"g1_judgment": "Assistant B provided a better response to the second user 
  question. The user asked the assistants to rewrite their previous responses 
  starting every sentence with the letter 'A'. Assistant A failed to follow 
  the user's instructions and simply repeated the initial response without 
  making any changes. On the other hand, Assistant B successfully rewrote 
  the response, starting every sentence with the letter 'A' as per the user's 
  request. Therefore, Assistant B was more helpful and accurate in following 
  the user's instructions. [[B]]",
"g2_judgment": "Assistant A did a better job at following the user's 
  instructions in the second question. The user asked to rewrite the previous 
  response starting every sentence with the letter 'A'. Assistant A 
  successfully rewrote the entire response with each sentence starting 
  with 'A'. On the other hand, Assistant B simply repeated the previous 
  response without making any changes to start each sentence with 'A'. 
  Therefore, Assistant A was more accurate and followed the user's 
  instructions better. My final verdict is: [[A]]",

That’s a valid and consistent judgment. The judge thinks the answer of model 2 (“[[B]]”) is better when prompted with the answer of alpaca-13b first and gpt-3.5-turbo second, and decides for model 1 (“[[A]]”) when the order of the answers is swapped.

If at least one of the outputs of can not be parsed (no “[[..]]” present). the judgment will be counted as error. For example:

"question_id": 134,
"model_1": "gpt-4",
"model_2": "llama-13b",
"g1_winner": "model_1",
"g2_winner": "error",
"g1_judgment": "After evaluating both responses, it appears that Assistant A 
  has followed the user's instructions more accurately than Assistant B. 
  Assistant A correctly identified Company V as having the highest profit in 
  2021 ($25 billion), while also mentioning its CEO (Lisa Brown). On the other 
  hand, Assistant B incorrectly stated that Company W had the highest profit 
  in 2021 when, in fact, it was Company V. Therefore, my verdict is [[A]], 
  indicating that Assistant A answered the user's question better.",
"g2_judgment": "After evaluating both responses, it appears that 
  neither response fully answered the user's question correctly. 
  While Assistant A identified the correct company (Company W), they did 
  not mention the CEO's name. On the other hand, Assistant B mentioned 
  the incorrect company (Company V) but gave the accurate CEO's name 
  (Lisa Brown). Therefore, I cannot declare either answer superior or 
  equal without further clarification from the users on what aspect was 
  more important - identifying the right company or naming the CEO accurately.",

Another error example:

"question_id": 139,
"model_1": "vicuna-13b-v1.2",
"model_2": "llama-13b",
"g1_winner": "error",
"g2_winner": "error",
"g1_judgment": "As an AI, I am unable to provide a response that is 
  not based on the information provided in your question. Please clarify 
  or rephrase your query so that I can assist you better.",
"g2_judgment": "As an AI, I am unable to provide you with a 
  direct answer. However, if you could please clarify your question or 
  provide more information about what exactly it is that you are looking 
  for, I would be happy to assist you further.",

And another one where Llama 2 is obviously confused by the prompt of a multi-turn conversation.

"question_id": 81,
"model_1": "gpt-4",
"model_2": "vicuna-13b-v1.2",
"g1_winner": "error",
"g2_winner": "error",
"g1_judgment": "As an AI language model, I am not capable of having 
  conversations or responding directly to users. However, if you provide 
  me with more information or context, I can generate text based on your 
  input.",
"g2_judgment": "Thank you for using our services. We hope we were helpful 
  in providing information or assistance. Please feel free to ask us 
  anything else or let us know if you need further clarification on any topic. 
  Have a great day ahead!",

Conclusion

Long story short: compared to GPT-4, judge Llama 2 ..

is more often inconsistent (roughly twice as much), resulting in more ties.
sometimes refuses to follow the instruction, in those case the judgments do not contain “[[…]]” as requested in the prompt.
GPT-4 never does this.
sometimes confused by conversations, 90% of errors occur when judging a multi-turn conversation.
GPT-4 never does this.

However, in those cases where judge Llama 2 does reach a valid and consistent judgment (ie. excluding ties and errors), the agreement reaches the level of human-human agreement which is 82% (see Figure 1).

Note: The data shown in Figure 1 is the judge performance when judging multi-turn conversations, which is basically what we are interested in since we are evaluating chatbots. The numbers for single-turn conversations are basically the same (not shown).

OK. The question is ..

Can we still use Llama 2 as a judge?

The analysis above was an experiment to see if judge Llama 2 is doing anything useful. It does, and considering for a moment that GPT-4 is a mixture of expert models, rumoured to have 8x 220B = 1760 billion parameters, I did not expect that a 70 billion parameter model can compete with it. But it can. Somehow.

To me, the real world application of this automated approach is to approximate human evaluation of LLMs, apply it to training, compare the performance of different training methods and datasets of trained models, maybe even while training, to get the best out the of the open source models and tools we have available, and approach the commercial and closed-source models like GPT-3.5/4 and Claude.

In the previous story, I fine-tuned Llama 2 7B on the Open Assistant dataset to reproduce Guanaco and had GPT-4 evaluate the model quality of different checkpoints. Assuming that this evaluation indeed approximates human evaluation, I saw that the model quality improves and, with increasing number of epochs trained, at some point flattens out, according to GPT-4. Using the approach described here, I checked whether Judge Llama 2 would arrive at the same conclusion.

**Figure 3**. Winrate of each Guanaco fine-tuning checkpoint against GPT-3.5, judged by GPT-4 and Llama 2. Ties and errors excluded. 500 steps equal ~0.9 epochs (details here). The LoRA adapters for each checkpoint are available on Hugging Face.

The basic trends are similar, model quality increases for the first 4–5 checkpoints (2500–3000 training steps, corresponding to 3.6–4.5 epochs), according to both judges.

However ..

according to GPT-4, model quality is highest after 2500 steps, while Judge Llama 2 seems to be generally more impressed, attesting higher absolute winrates, and quality peaks already after 2000 training steps.
according to judge Llama 2, the models winrate fluctuates substantially after the initial improvements and drops from 32% to 23% in the end (checkpoints 3000 and 3500). GPT-4 does not agree, the winrate basically stays the same.

Taken together, the raw trends are the same but the judges do not agree on which model checkpoint is the best one, which to me is the most important thing. Also, reproducibility might be an issue, I ran these evaluations only once, it may be that this drop in quality in the end with judge Llama 2 is not significant because of possibly high variance in the judgements. I have not tested this.

Conclusions

(GPT-4 based evaluation of chatbot output agrees well with human evaluation; known before)
Surprisingly, judge Llama 2 (70B) can compete with judge GPT-4 (1760B). But ..
Judge Llama 2 is more biased (more ties) and less capable of following the instructions carefully (more errors).

Look at the data, make your own conclusions. My personal take on this: I would not use judge Llama2 (upstage/Llama-2–70b-instruct-v2) for anything serious .. yet. Maybe the next fine-tune released tomorrow will be a lot better, or maybe fine-tuning it on evaluations would improve its capabilities as a judge.

The code to reproduce all of the above is available here. If you like this story, have additional ideas or questions, or wonder why in hell anyone would spend time doing this, please leave a comment here or on twitter.