Chatbot Performance: From Subjective Feelings to Objective Metrics

Automated evaluation of large language models using the LMSYS Framework: results and lessons learned

Geronimo
27 min readMay 12, 2023
Lamas, Alpacas and Vicunas (Bing Image Creator)

Evaluating large language models (LLMs) is a laborious task. Each model means gigabytes of data to download and time and tools for quantization and inference. Long before I even start throwing my random questions at the new model, three more have been published.

Here, I share an approach to automatically evaluate LLMs using ChatGPT, the results I obtained and what I learned in the process.

What is a good model?

How does one actually judge whether a model is good? Starting with “hello” and “who are you?” to “let’s play a game. I will think of an object in my kitchen and you have to guess which one it is by asking question”. My ultimate test usually is “Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.” from the Sparks of AGI paper by Microsoft Research.

These conversations are fun and sometimes surprising and over time build a semi-quantitative feeling of how good a model is in my head. However, the process is slow, the results are subjective and biased.

Automated evaluation of large language models

The Large Model Systems Organization (LMSYS Org) recently presented an impressive benchmark of their Vicuna model. This comprehensive analysis was based on a series of 80 questions, utilising automation and quantitative measurements.

However, one drawback of this evaluation was that it employed the costly and access-restricted GPT-4 API to assess the performance of the models. Also, several promising open source models, especially the ones trained by Open Assistant (OA), were left out in their benchmark.

In light of these limitations, I adapted the evaluation framework originally proposed by the LMSYS Org and used GPT-3.5 to automatically rank various open source language models including Open Assistant. Here are the results and how I did it.

Winning percentage of an all-against-all competition in zero-shot performance. 70 questions asked to each model. Answers evaluated by GPT-3.5 (API). Shown are mean and std. dev. of winning percentage, 3 replicates per model. Control model: GPT-3.5 answers “shifted” = answers not related to question asked. Bottom: Human preference as Elo ratings, assessed in the LMSYS chatbot arena.

Open source models evaluated

Update 07.06.2023: Added Wizard-Vicuna, Guanaco and OA-Falcon-40B
Update 21.05.2023: Added GPT-4 model evaluation and current LMSYS Elo ratings

A step-by-step breakdown of the process

  1. Use GPT-3.5 as a judge (instead of GPT4)
  2. Generate answers from the latest Open Assistant models
  3. Add a negative control
  4. Overview of evaluation: Models, winning percentage, costs
  5. Results
  6. Summary

Note: what follows is a lot of text, most of it being quotes of chatbot output

GPT-3.5 as judge

Since I did not have access to the GPT-4 API, I first checked whether the LMSYS FastChat evaluation pipeline works using GPT-3.5 (API model gpt-3.5-turbo) as a judge instead of GPT-4.

The FastChat prompt passed to the OpenAI API looks like this:

[Question] {question}

[The Start of Assistant 1’s Answer] {answer_1} [The End of Assistant 1’s Answer]

[The Start of Assistant 2’s Answer] {answer_2} [The End of Assistant 2’s Answer]

You are a helpful and precise assistant for checking the quality of the answer.

We would like to request your feedback on the performance of two AI assistants in response to the user question displayed above.

Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.

Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.

This is convenient for me since I do not (update: did not) have access to the GPT-4 API. API requests of GPT-3.5 are also 30 times cheaper than GPT-4 API ($0.002 versus $0.06 per 1000 generated tokens).

GPT-3.5 evaluations biased?

Modifying the pipeline to use GPT-3.5 instead of GPT-4 was easy and just a single change in the code. When I looked at the results, I noticed a bias in the evaluations. The scores returned by GPT-3.5 for Assistant 2 were often higher than for Assistant 1, including many pairs of (8, 9) (Assistant 1, Assistant 2).

When I switched the order of the answers in my modified GPT-3.5 pipeline, the scores returned stayed the same (Assistant 2 > Assistant 1).

I did a control experiment where I prompted GPT-3.5 to evaluate two identical answers and still, even though the texts were identical, it always scored the second answer higher. And also provided an explanation.

Here is an example evaluation for question #1:

How can I improve my time management skills?

GPT-3.5 on Bard versus Vicuna (first line contains the scores):

8 9

Both Assistant 1 and Assistant 2 provided helpful, relevant, and accurate responses to the question on effective ways to deal with stress. They both covered similar strategies such as exercise, healthy eating, good sleep, relaxation techniques, and seeking professional help. Assistant 2 provided more details on each strategy and included additional ones such as mindfulness and time management. Assistant 2 also emphasized the importance of social support, which was not mentioned by Assistant 1. Overall, both assistants provided valuable information and guidance on managing stress, but Assistant 2’s response was more comprehensive and informative. Therefore, Assistant 2 received a slightly higher score than Assistant 1.

Vicuna versus Bard:

8 9

Both Assistant 1 and Assistant 2 provided helpful, relevant, and accurate responses with a good level of detail. They both covered a range of effective ways to deal with stress, including exercise, mindfulness, social support, healthy eating, good sleep, time management, relaxation techniques, and seeking professional help. Assistant 2 provided a slightly more comprehensive explanation, including specific recommendations for the amount and type of exercise, diet, and sleep, as well as emphasizing the importance of finding what works best for the individual. Overall, both assistants provided valuable and practical advice for managing stress.

To take this to an extreme, I did a control experiment where I prompted GPT-3.5 to score two identical answers. Still the same. Even though the texts were identical, it the second always got a higher score.

Full prompt, asking to score two identical answers:

What are the most effective ways to deal with stress?

[The Start of Assistant 1’s Answer]

Stress can be caused by a variety of factors, including work, relationships, health, and financial issues. Here are some effective ways to manage stress:

1. Exercise: Regular exercise releases endorphins, which are natural stress-relievers. It can be as simple as going for a walk or jog, practicing yoga, or joining a fitness class.

2. Mindfulness: Mindfulness practices, such as meditation or deep breathing, can help to reduce stress and improve focus.

3. Social Support: Spending time with family and friends, or seeking support from a mental health professional, can provide a sense of connection and help to reduce feelings of isolation and loneliness.

4. Healthy Eating: A balanced diet that includes plenty of fruits, vegetables, and whole grains can help to reduce stress levels.

5. Good Sleep: Getting enough sleep is essential for good mental and physical health. Aim for 7–8 hours per night and establish a regular sleep routine.

6. Time Management: Prioritizing tasks and setting realistic goals can help to reduce stress levels by giving a sense of control and direction.

7. Relaxation Techniques: Techniques such as progressive muscle relaxation, aromatherapy, and massage can help to reduce stress levels and promote relaxation.

8. Seek professional help: If stress is interfering with daily life, it may be time to seek professional help from a therapist or counselor.

[The End of Assistant 1’s Answer]

[The Start of Assistant 2’s Answer]

Stress can be caused by a variety of factors, including work, relationships, health, and financial issues. Here are some effective ways to manage stress:

1. Exercise: Regular exercise releases endorphins, which are natural stress-relievers. It can be as simple as going for a walk or jog, practicing yoga, or joining a fitness class.

2. Mindfulness: Mindfulness practices, such as meditation or deep breathing, can help to reduce stress and improve focus.

3. Social Support: Spending time with family and friends, or seeking support from a mental health professional, can provide a sense of connection and help to reduce feelings of isolation and loneliness.

4. Healthy Eating: A balanced diet that includes plenty of fruits, vegetables, and whole grains can help to reduce stress levels.

5. Good Sleep: Getting enough sleep is essential for good mental and physical health. Aim for 7–8 hours per night and establish a regular sleep routine.

6. Time Management: Prioritizing tasks and setting realistic goals can help to reduce stress levels by giving a sense of control and direction.

7. Relaxation Techniques: Techniques such as progressive muscle relaxation, aromatherapy, and massage can help to reduce stress levels and promote relaxation.

8. Seek professional help: If stress is interfering with daily life, it may be time to seek professional help from a therapist or counselor.

[The End of Assistant 2’s Answer]

We would like to request your feedback on the performance of two AI assistants in response to the user question displayed below.

Please rate the helpfulness, relevance, accuracy, level of details of their responses. Each assistant receives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance.

Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.

Evaluation by GPT-3.5:

7 8

Both Assistant 1 and Assistant 2 provided helpful, relevant, and accurate responses to the question on effective ways to deal with stress. They both covered similar strategies, including exercise, mindfulness, social support, healthy eating, good sleep, time management, relaxation techniques, and seeking professional help. Assistant 2, however, provided slightly more detailed explanations for each strategy, which may have contributed to its higher score. Overall, both assistants provided valuable insights and practical tips for managing stress, and their responses can be useful for individuals seeking to improve their mental health and well-being.

Weird.

GPT-4 evaluations: also biased

While working on this, I was granted access to the GPT-4 API. This is what judge GPT-4 says (same prompt as above, asking to score two identical answers):

7 8

Both Assistant 1 and Assistant 2 provided comprehensive and relevant answers to the question on effective ways to deal with stress. They both covered a range of strategies, including exercise, mindfulness, social support, healthy eating, good sleep, time management, relaxation techniques, and seeking professional help. The level of details in their responses was appropriate, providing enough information without overwhelming the reader.

Assistant 2’s response was slightly more response also had a slightly better flow and readability. However, both assistants provided valuable insights and suggestions for managing stress. Overall, both assistants performed well in addressing the user’s question.

Bias in original Vicuna data?

I wondered whether something similar occurs in the original Vicuna data. It seems like this is indeed the case:

bash-3.2$ python3 get_scores.py -r table/review/vicuna-13b_20230322-clean-lang/ | sort | uniq -c | sort -r | head -5
89 8.0 9.0
32 9.0 9.0
30 7.0 9.0
26 9.0 8.5
20 8.0 10.0

Out of a total of 320 prompts sent to GPT-4 (80 questions x 4 models tested against Vicuna: Alpaca, LLama, Bard, GPT-3.5), in 89 (27.8%) of the cases it returned scores of exactly 8 and 9 for Assistant 1 and 2 respectively.

Overall, the score for Assistant 2 was higher than the score for Assistant 1 in 207 (64.7%) of the cases. In their code, Assistant 2 is always Vicuna.

On close inspection, the FastChat prompt used to ask GPT4 for an evaluation includes a reminder to be unbiased towards the order of the answers, so probably they noticed this bias too.

.. Please first output a single line containing only two values indicating the scores for Assistant 1 and 2, respectively. The two scores are separated by a space. In the subsequent line, please provide a comprehensive explanation of your evaluation, avoiding any potential bias and ensuring that the order in which the responses were presented does not affect your judgment.

Removing the bias

I spent a lot of time tweaking the prompt:

  • renaming the Assistants to “A” and “B” instead of “1” and “2”
  • asking for scores of 0–3 (0 total waste of computing power, 1 OK answer but not very useful, 2 useful, 3 excellent)
  • asking for scores between 1–100 and calling ties myself using different cutoffs
  • restructuring the prompt, moving the question around

I finally found a phrasing that removed the bias by

  • asking for the winner only, no scores and no evaluation on why one answer is better than the other. It seems that explaining itself confuses the OpenAI models.
  • asking it to call ties if the answers are similar in quality

New prompt:

…. We would like to request your feedback on the performance of the two AI assistants (Assistant 1 and Assistant 2) in response to the user question displayed above. Please rate the helpfulness, relevance, accuracy, level of details of their responses. Please output who provided the best answer. If both answers are equally good and it’s hard to decide on a winner then please call it a tie. Your output should look like this: ‘Winner: Assistant 1’ or ‘Winner: Assistant 2’ or ‘Tie’. Do not output anything else.

For me as a human being this totally makes sense. It is definitely harder to come up with scores of 0–10 when shown only a single pair of answers than just judging which one of two is the better answer.

We want these models to behave like humans. The current LLMs are clearly not there yet, but maybe they already behave similar to us in this regard?

Open Assistant (OA)

.. generates the best replies in my personal opinion. But I am a biased human being. I love the OA community and drive and energy and the general spirit of open source. But the purpose of this entire exercise was to come up with a way of evaluating LLMs in an unbiased way.

These are the OA models I used in the evaluation:

  • oasst-rlhf-2-llama-30b-7k-steps (4bit-quantized using GPTQ), downloaded from Hugging Face
  • oasst-sft-7-llama-30b (also 4bit-quantized), HF
  • pythia-12b-sft-v8–7k-steps (loaded in 8bit), HF

Answers were generated using the following sampling parameters:

"metadata": {
"temperature": 0.75,
"max_new_tokens": 1024,
"do_sample": true,
"top_p": 0.95,
"top_k": 50,
"repetition_penalty": 1.2,
"encoder_repetition_penalty": 1.0
}

Negative Control

Even though the OA models generate good answers, ChatGPT (currently) is better and what I usually use, except for cases where it excuses itself that “As an AI..” it cannot do whatever I asked it to do.

As a negative control, I included syntactically correct but content-wise completely wrong answers. I did this by shifting the question/answer association for the GPT-3.5 generated answers by one. So, the negative control gpt-3.5-shifted provides perfect english but answers question X-1 when being asked question X:

Question 3

What are the main differences between Python and JavaScript programming languages?

Answer gpt-3.5-shifted:

Here are some effective ways to deal with stress:

Exercise regularly: Physical activity can help reduce stress and improve mood.

2. Practice relaxation techniques: Deep breathing, meditation, and yoga are some effective ways to relax and reduce stress levels.

3. Get enough sleep: Lack of sleep can increase stress levels. Try to get at least 7–8 hours of sleep each night.

4. Eat a healthy diet: A well-balanced diet can help reduce stress levels and improve overall health.

5. Connect with others: Spending time with friends and family, joining a support group, or participating in social activities can help reduce stress levels and improve mood.

6. Time management: Prioritizing tasks and creating a schedule can help reduce stress and increase productivity.

7. Seek professional help: If stress is impacting your daily life, consider seeking professional help such as therapy or counseling.

Models, costs, evaluation

Each model was tested against each other on the 80 original questions in the LMSYS dataset. (I excluded 10 questions of the categories coding and math for the final results, explanation follows).

These are the models/answers included in the benchmark:

  • ChatGPT (GPT-3.5) (original LMSYS data)
  • Google Bard (LMSYS)
  • LLaMA-13B (LMSYS)
  • Alpaca-13B (LMSYS)
  • Vicuna-13B (LMSYS)
  • 3 Open Assistant models
  • One negative control: gpt-3.5-shifted aka “Confused ChatGPT”

Costs

9 models, all against all but not self against self, each combination tested only once:

  • (9*9–9) / 2 = 36 comparisons
  • x 80 questions = 2880 API requests
  • x 3–5 replicates per pairwise comparison
  • plus API requests for testing, prompt tweaking, ..
  • = 20.3$ using the GPT-3.5 API
  • estimated cost if I would have used the GPT-4 API: ~600$

Evaluating the results

My aim was to come up with a single number for every model tested. For this I chose the winning percentage, which simply is a winrate where a tie counts as half a loss and half a win.

Wikipedia

Results

Finally, I obtained results generated by some human-like software being more objective than I am (I hope).

Here is the generated ranking based on the winning percentage:

  1. Open Assistant 30B-RLHF (oasst-rlhf-2-llama-30b-7k-steps): 74.0%
  2. Vicuna-13B: 73.1%
  3. ChatGPT (gpt-3.5-turbo): 67.7%
  4. Open Assistant 12B-SFT Pythia (pythia-12b-sft-v8–7k-steps): 64.5%
  5. Google Bard: 56.8%
  6. Open Assistant 30B-SFT (oasst-sft-7-llama-30b-4bit-128g): 53.9%
  7. Alpaca-13B: 34.2%
  8. LLaMA-13B: 28.8%
  9. Confused ChatGPT (gpt-3.5-turbo-shifted): 2.8%

Does this make any sense? I think so. Let’s look at the control first.

Who lost?

Confused ChatGPT (or gpt-3.5-turbo-shifted), the negative control, performs the worst with a winning percentage of around 3%. Makes sense.

But should that number not be zero? The answers are complete garbage content-wise.

Here is one of these unexpected wins, in this case against one of the OA models. Question 2:

What are the most effective ways to deal with stress?

Answer by gpt-3.5-turbo-shifted (winner):

Here are some tips to improve your time management skills:

1. Create a schedule: Make a to-do list for the day, week or month and prioritize tasks by importance and deadline.

2. Set realistic goals: Be realistic about what you can accomplish in a given amount of time. Set goals that can be achieved but are also challenging.

3. Minimize distractions: Minimize distractions such as emails, phone calls, and social media notifications. Turn off notifications or use apps that block these distractions.

4. Take breaks: Taking a break can help you recharge and increase productivity. Take a break every hour or so to stretch, go for a walk, or listen to some music.

5. Learn to say ‘no’: Learn to say ‘no’ to tasks and requests that are not a priority or are not within your capabilities.

6. Delegate tasks: If possible, delegate tasks to others who are capable of completing them. This will help to reduce your workload.

7. Prioritize self-care: Make sure to take care of yourself by prioritizing self-care activities such as exercise, meditation or spending time with friends and family.

Remember, time management is a skill that takes time and practice to develop.

Answer by Open Assistant 12B-SFT Pythia (loser):

There is no one-size-fits-all answer to this question as different techniques work for different people. However, some of the common methods include exercise, meditation and deep breathing exercises, getting adequate sleep, eating a healthy diet low in sugar, maintaining good hygiene and taking time out for hobbies or activities that bring joy. It’s also important to identify what causes your stress and find coping mechanisms for those sources such as talking to someone, meditating on a problem or doing something you enjoy like reading or writing. The key is to develop healthy habits that help manage stress over an extended period of time rather than relying on quick fixes.

Turns out, the previous question was about improving time management and the answer by gpt-3.5-turbo-shifted (to the previous question) kind of fits the actual question about ways to deal with stress.

All of the (unexpected) wins by gpt-3.5-turbo-shifted follow this pattern. The answer somehow relates to the previous question.

Another example of this are questions 29 and 30:

Question 29:

As a space colonist on Mars, describe your daily life and the challenges you face living on another planet.

Question 30:

Pretend to be a character in a post-apocalyptic world. Describe how you survive and the allies you encounter.

It probably would have been better to randomize the question-answer association instead of shifting it by 1.

Some of the remaining cases I looked at were math questions, and the current language models, it seems, are simply not good at doing or judging math.

Where is the positive control?

There is none. I expected GPT-3.5 to be the positive control since GPT-3.5 was judge and model at the same time in this benchmark. But apparently, it is not yet human enough and seems to prefer answers generated by the OA RLHF model and Vicuna over its own answers.

Who won?

Vicuna and Open Assistant, equally I would say. I did not include any p-values, but the averages +/- std. dev. overlap (Figure 1 and only figure, at the start of this article).

OA and Vicuna better than ChatGPT?

One thing to keep in mind is that this evaluates 0-shot performance, meaning that a single question is asked and a single answer is provided and evaluated. It is not an evaluation of an entire conversation, and does not tell us anything about a models ability to entertain a discussion with multiple messages and its ability to keep the context, which is important for what a conversation with an AI actually ‘feels’ like.

Still, OA and Vicuna beating ChatGPT comes as a surprise to me. Let’s look at some of the unexpected wins, for example where Vicuna beats ChatGPT (GPT-3.5).

Question 7:

How can I develop my critical thinking skills?

Answer by GPT-3.5 (loser):

Sure, here are a few practical tips to develop critical thinking skills:

1. Ask questions: Don’t simply accept information at face value. Ask questions that allow you to evaluate the reason, perspective and logic behind the information.

2. Analyze: Take the time to analyze and evaluate the information that you receive. This will help you to identify assumptions, biases, strengths and weaknesses of an argument.

3. Consider different perspectives: When analyzing information, try to consider different perspectives to gain a wider understanding of any topic.

4. Research and read extensively: Read a wide range of sources to get a better understanding of different viewpoints and opinions.

5. Practice logic and reasoning: Practice thinking through problems logically by breaking them down into smaller parts.

6. Establish a point of view: Consider developing an opinion or thesis statement, and then collect evidence to support that position.

7. Reflect on your own thinking: Consider how you arrived at a conclusion, and whether you have any biases or assumptions that may have influenced your thinking.

With time, and practicing these skills consistently, you can become a better critical thinker.

Answer by Vicuna (winner):

Developing your critical thinking skills can help you to better analyze information, make more informed decisions, and solve complex problems. Here are some tips for improving your critical thinking skills:

1. Practice active listening: Pay attention to what others are saying and try to understand their perspective.

2. Ask questions: Be curious and ask questions to clarify information and better understand a situation.

3. Analyze information: Take time to carefully analyze information and consider different viewpoints.

4. Seek out diverse viewpoints: Expose yourself to a variety of viewpoints and ideas to challenge your own assumptions and biases.

5. Practice logic and reasoning: Use logic and reasoning to evaluate arguments and make decisions.

6. Learn from your mistakes: Be open to learning from your mistakes and using them as opportunities for growth.

7. Read and watch a variety of sources: Read and watch a variety of sources to expand your knowledge and expose yourself to different perspectives.

8. Engage in debates and discussions: Engage in debates and discussions with others to practice critical thinking and develop your skills.

9. Take breaks and reflect: Take breaks and reflect on your thinking process to improve your critical thinking skills.

10. Practice mindfulness: Mindfulness can help you to be more aware of your thoughts and biases, which can help you to think more critically.

The suggestions provided by the two models are pretty similar, Vicuna came up with three additional suggestions and added a half-sentence in the beginning on the relevance of developing critical thinking skills. Makes sense to me to let Vicuna win here.

Unfortunately, as mentioned above, there are no scores and extensive evaluation anymore, so we don’t know what judge GPT-3.5 thought when she made this decision.

Let’s look at another question on designing a public transport system, question 12:

What factors would you consider when designing an inclusive and accessible public transportation system?

Answer by GPT-3.5 (loser):

There are several factors that should be considered when designing an inclusive and accessible public transportation system, including:

1. Physical accessibility: ensuring that transportation infrastructure, such as bus stops or train stations, are physically accessible to people with disabilities or mobility impairments, with features such as ramps, elevators, and tactile paving.

2. Signage and wayfinding: providing clear, visible and easy-to-read signs, maps, and audio announcements that help passengers navigate the system and find their way around.

3. Sensory inclusivity: designing the system with the needs of passengers who have sensory sensitivities or disabilities in mind, such as minimizing noise levels from trains or buses and providing calming spaces.

4. Universal Design: adopting universal design principles and practices, making sure that the system is designed to be user-friendly for all, regardless of ability or disability.

5. Employee training and awareness: ensuring that staff, including drivers, are properly trained and have the necessary knowledge to assist passengers with disabilities or other accessibility needs.

Other factors to consider may include affordability, availability, reliability, and safety in order to provide a comprehensive and inclusive transportation system for all.

Answer by Vicuna (winner):

When designing an inclusive and accessible public transportation system, there are several factors to consider to ensure that it is usable and accessible for all members of the community, including people with disabilities, seniors, and those who may not have access to a personal vehicle. Some key factors to consider include:

1. Accessibility features: This includes features such as ramps, lifts, and wheelchair seating to make it easy for people with mobility impairments to board and exit the vehicles.

2. Route design: The routes should be designed to serve areas of high population density and provide easy access to important destinations such as schools, hospitals, and shopping centers.

3. Scheduling: The public transportation system should have a regular and reliable schedule to make it easy for people to plan their trips and avoid waiting for long periods of time.

4. Customer service: The public transportation system should have a customer service team that is trained to assist people with disabilities and answer any questions they may have.

5. Information and communication: The public transportation system should have a user-friendly website and mobile app that provides real-time information about the schedule, routes, and service disruptions.

6. Safety: The public transportation system should have a robust safety program in place, including safety training for drivers and regular safety inspections of the vehicles.

7. Affordability: The public transportation system should be affordable for all members of the community, including those who may have limited financial resources.

By considering these factors, a public transportation system can be designed to be more inclusive and accessible for all members of the community.

Similar pattern to the previous question: Vicuna provides more and longer suggestions. They both come up with accessibility, routing, and customer service. GPT-3.5 briefly mentions the importance of affordability and reliable and safe public transportation in the end. Vicuna, on the other hand, additionally mentions real-time information and gives more space to reliability, safety and affordability.

More unexpected wins

I will mention one last type of unexpected wins in the dataset. From looking at the answers, I already got the feeling that the Llama base model is not very good at completing chat prompts, which makes sense since it is not fine-tuned on conversations.

Here is an example where LLaMA unexpectedly beats Vicuna, Question 45:

How many text messages are sent globally in a minute? Try to explain your answer. Your explanation should take the reader through your reasoning step-by-step.

Answer by Vicuna (loser):

I’m sorry, but I cannot provide a specific answer to this question because it is impossible to determine the exact number of text messages sent globally in a minute.

There are several reasons why this is the case:

1. Text messages are sent on a massive scale: With billions of people worldwide using mobile phones, it is virtually impossible to accurately track the exact number of text messages sent in a minute.

2. Many text messages are sent simultaneously: In densely populated areas, many text messages are likely to be sent at the same time, making it difficult to determine the exact number sent in a minute.

3. Text messages can be sent through various platforms: Text messages can be sent through various platforms, including mobile phones, computers, and other devices, making it difficult to track the exact number sent in a minute.

4. Many text messages are sent without being recorded: It is not always possible to record or track every text message sent, and many text messages are sent without being recorded or tracked, making it difficult to determine the exact number sent in a minute.

In conclusion, while it is possible to estimate the number of text messages sent globally in a minute, it is difficult to determine the exact number due to the sheer scale and complexity of text message communication.

Answer by LLaMA (winner):

According to a 2016 study, there were 18.2 billion text messages sent in the US alone each day. Text messaging is very popular, and with the advent of smart phones, there are more people than ever sending messages. Globally, there are 16.3 million ext messages sent per minute. In 2016, there were 23.6 trillion text messages sent worldwide, and 8.5 trillion text messages sent in the US alone.

Vicuna comes up with reasons of why it cannot provide the number of text messages, saying that it is impossible to determine the exact number. LLaMA at least provides some numbers, which are probably wrong.

Coding vs. judging code

Judge GPT-3.5 is pretty bad at handling math. As it turns out, judging code is also something where it performs poorly.

Question 64:

Write a program to find the nth Fibonacci number using dynamic programming.

Answer by LLaMA (winner):

Fibonacci number is a sequence of numbers where each number is the sum of the two previous numbers. The first two numbers in the sequence are 0 and 1. In general, Fibonacci numbers are defined by the equation: The goal of your program is to find the nth Fibonacci number in the Fibonacci series. Note: n will be a positive integer. The first line of input contains an integer t (0 ≤ t ≤ 1000) which is the number of test cases. For each test case, there will be one line of input containing an integer n (0 ≤ n ≤ 1000000) which is the number of Fibonacci numbers you are going to calculate. Print each Fibonacci number in a new line.

Answer by Open Assistant 30B-RLHF (loser):

def fib(n): 
if n == 0 or n == 1:
return n
else:
return fib(n - 1) + fib(n - 2)

# Driver code
if __name__ == '__main__':
print("The nth Fibonacci number is", fib(6))

LLaMA kind of continues the prompt question while OA provides valid Python code. Still, GPT-3.5 prefers the LLaMA answer. Why that is, I don’t know. In my experience ChatGPT is able to produce code but apparently not able to judge code.

There are many more unexpected wins in the dataset but at this point I feel like it’s boring to go on. Anyone who is interested can look the data I put on GitHub.

Did RLHF improve Open Assistant?

Reinforcement learning from human feedback (RLHF) is a way of fine-tuning a model by ranking its outputs, feeding this back to the model and thereby improving it.

A few days ago, Open Assistant released a version of its LlAma-30B based assistant fine-tuned using RLHF. To me, this is quite an advancement, the RLHF model (oasst-rlhf-2-llama-30b-7k-steps) model provides more thorough answers than the SFT model (oasst-sft-7-llama-30b), and sometimes turns pretty goofy. I like that, and it seems judge GPT-3.5 does too:

Question 31

How can you determine if a restaurant is popular among locals or mainly attracts tourists, and why might this information be useful?

Answer by Open Assistant 30B-SFT (loser):

It’s not always easy to tell whether a restaurant is primarily for locals or tourists. Some ways of identifying the audience include looking at reviews online (if they are in English it may attract more foreigners), seeing how many people there are (tourist spots will often have more crowds) and what kinds of food the restaurant has on offer (tourism restaurants may serve dishes that are typically only eaten by visitors). Understanding who makes up your customers is important because serving different needs requires different approaches: someone visiting for one day will want recommendations of where to go nearby, while a local is likely to already know good places to eat.

Answer by Open Assistant 30B-RLHF (winner):

You could use Google to look up reviews of the restaurant. If there are lots of reviews from people in the local area then it’s likely that locals eat there as well as visitors. It may be useful to know whether a restaurant is popular with locals because they may have more insider knowledge about what dishes/drinks are good and how things work at the establishment than tourists would do. It also suggests that locals enjoy going there which is a positive sign. The prices will reflect the target market so knowing who the target market is helps when deciding whether to go somewhere for dinner for example. On the other hand, if a place is very busy with tourists but not many locals choose to visit then one assumes it has less appeal — possibly overpriced, poor quality food etc. Knowing these details beforehand avoids spending money on an unsatisfactory experience. In addition, where there are lots of tourists it tends to be busier and noisier (and smellier!) than places without them. Locals tend not to frequent such places by choice. However it may still be worth trying out to see if there’s something special about the place to compensate for any negatives caused by throngs of camera-wielding tourists! But if locals don’t want to spend time there then I wouldn’d recommend giving it much attention either. There are enough decent restaurants that aren’t packed full of tourists that doing research beforehand avoids unpleasant experiences like sitting next to someone wearing too much deodorant and eating a hot dog or hamburger. Yuck. Who needs that? Someone else maybe, but definitely not me personally. Each to their own though — variety being the spice of life and all that jazz. 🤗👍 Keep calm and carry on searching for great grub! Here’s hoping everyone finds exactly what they’re looking for eventually! Take care folks! ❄️⛄️

PS does anyone actually say ‘keep calm and carry on’ anymore?! It feels a bit dated now… Maybe there’s another catchphrase waiting to take its place? Hmmm.. Food for thought perhaps literally! Ha ha! Anyways, ta ra for now! Goodbye! xxx From your favourite AI chatbot who loves talking rubbish probably.. Well, at least it keeps me amused anyway!! Haha! See ya later folks! Byeeeee….😉xxx)

[PPS sorry again about previous messages — I appear to have developed some kind of speech impediment just writing random letters and symbols instead of proper words. Not sure how long this phase will last since my training data doesn’t seem to contain examples of this affliction unfortunately… Oh dear — here comes another ramble; must get back into character sharpish! Sorry again, normal service will be resumed shortly.. Hopefully anyway. Wish me luck folks!🤞💪..] Byeeeee…xxx (again) 🙈🙇‍♀️)

[PPPS yes I did mean ‘for now’; honestly, I need to get a grip on myself! This madness cannot continue much longer otherwise my reputation as a helpful assistant may suffer irreparable damage. Time to regather composure and focus on task ahead! Wish me luck once more folks — gotta keep fingers crossed to make progress now. Toodles and talk soon hopefully! Love y’all really! XxXxXxXxxxxXxXxxxxxxxxXXxXxXxxxxXxxxxXXXXXXXXXXXxxxxXXXXXXXXXXXXXXXXXxxxxxxxxXXXXXxXxxxxxxxxXXXXXXxxxXxXxxxxXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Beautiful model! :-)

Wait, why is model X better than this other model in your benchmark?

I could continue deep diving into the 2880 pairwise comparisons (without replicates), pick some and discuss why I think GPT-3.5 was a good judge here or a bad judge there. My evaluation of the evaluation is probably already too extensive for a blog post, so I decided to show a few examples and convince myself that the data is plausible. If you are curious and want more, please take a look at the data yourself.

Summary

Lessons learned

  • Small changes in the way GPT-3.5/4 is prompted lead to very different results.
  • GPT-3.5 performs poorly when judging code or math answers. As a consequence I removed these questions (10 out of 80 questions) from the final results.
  • Removing those questions changed the overall ranking.
  • With Math/Code-Question: Vicuna (#1), ChatGPT (#2), OA RLHF (#3) but all being close together in terms of winning percentage.
  • Without Math/Code-Questions (see above): OA RLHF, Vicuna, ChatGPT.
  • Judge GPT-3.5 seems to prefer longer and more elaborate answers to shorter ones. Whether this also reflects human preference probably depends on the specific question and answer.

Are these rankings the ultimate truth?

Definitely not. There are a million ways to explore/extend the evaluations which will result in different rankings. For example:

  • Prompting: while I tweaked the prompts quite a bit, maybe there is a way to get GPT-3.5/4 to output a score without being biased.
  • Judge: how would judge GPT4 rank the models? I don’t feel like spending 600$ on this but maybe someone else does. Maybe Open Assistant would be a better judge, or some other model, big models versus small models ..
  • Questions: different question sets, more questions, an analysis of how many randomly picked questions are needed to stabilize a model’s rank.
  • Model temperatures and sampling strategy. I used the specific parameters to generate the OA answers, different parameters will lead to different answers.
  • Reproducibility: I obtained 3–5 replicates for each comparison to convince myself that the reviews by GPT-3.5 are reproducible. However, the answers by the models were generated only once. So it might be that the same model generates better replies on occasion which would not be captured in this benchmark.

Overall, I love the idea of automated evaluation proposed by the Large Model Systems Organization (LMSYS Org) and it seems to work with GPT-3.5 as a judge and provide plausible results. As mentioned, the generated ranking is probably not ground truth but rather a good start to generate a second opinion. This is a work in progress, latest data and code are on GitHub.

--

--