Exploring ChatGPT vs open-source models on slightly harder tasks

17 min readMay 12, 2023

All images were generated by Marco and Scott.

Open-source LLMs like Vicuna and MPT-7B-Chat are popping up all over the place, which has led to much discussion on how these models compare to commercial LLMs (like ChatGPT or Bard).

Most of the comparison has been on answers to simple one-turn question / instructions. For example, the folks at LMSYSOrg did an interesting analysis (+1 for being automated and reproducible) comparing Vicuna-13B to ChatGPT on various short questions, which is great as a comparison of the models as simple chatbots. However, many interesting ways of using LLMs typically require complex instructions and/or multi-turn conversations, and some prompt engineering. We think that in the ‘real world’, most people will want to compare different LLM offerings on their problem, with a variety of different prompts.

This blog post (written jointly with Scott Lundberg) is an example of what such an exploration might look like with guidance, an open-source project that helps users control LLMs. We compare two open source models (Vicuna-13B, MPT-7b-Chat) with ChatGPT (3.5) on tasks of varying complexity.

Warmup: Solving equations

By way of warmup, let’s start with the toy task of solving simple polynomial equations, where we can check the output for correctness and shouldn’t need much prompt engineering. This will be similar to the Math category in here, with the difference that we evaluate models as correct / incorrect on the ground truth, rather than using GPT-4 to rate the output.

Quick digression on chat syntax: each of these models have their own chat syntax, with special tokens separating utterances. Here is how the same conversation would look like in Vicuna and MPT (where [generated response] is where the model would generate its output):

Vicuna:

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.  
USER: Can you please solve the following equation? x^2 + 2x + 1 = 0  
ASSISTANT: [generated response] </s>

MPT:

<|im_start|>system
- You are a helpful assistant chatbot trained by MosaicML.  
- You answer questions.
- You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
- You are more than just an information source, you are also able to write poetry, short stories, and make jokes.
<|im_end|>
<|im_start|>user Can you please solve the following equation? x^2 + 2x + 1 = 0<|im_end|>
<|im_start|>assistant [generated response]<|im_end|>

To avoid the tediousness translating between these, guidancesupports a unified chat syntax that gets translated to the model-specific syntax when calling the model.
Here is the prompt we'll use for all models (note how we use {{system}}, {{user}} and {{assistant}} tags rather than model-specific separators):

find_roots = guidance('''
{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
Please find the roots of the following equation: {{equation}}
Think step by step, find the roots, and then say:
ROOTS = [root1, root2...]
For example, if the roots are 1.3 and 2.2, say ROOTS = [1.3, 2.2].
Make sure to use real numbers, not fractions.
{{~/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

We then load the models. Note that we’re using the standard system message in the prompt above.

import guidance

mpt = guidance.llms.transformers.MPTChat('mosaicml/mpt-7b-chat', device=1)
vicuna = guidance.llms.transformers.Vicuna('yourpath/vicuna-13b', device_map='auto')
chatgpt = guidance.llms.OpenAI("gpt-3.5-turbo")

Let’s try these prompts on a very simple example.
Here is ChatGPT:

equation = 'x^2 + 3.0x = 0'
roots = [0, -3]
answer_gpt = find_roots(llm=chatgpt, equation=equation)

Vicuna (we omit the system and user part from now on):

answer_vicuna = find_roots(llm=vicuna, equation=equation)

MPT:

answer_mpt = find_roots(llm=mpt, equation=equation)

The answer was [-3, 0], and thus only ChatGPT got it right (Vicuna didn’t even follow the specified format).

In the notebook accompanying this post, we write a function to generate random quadratic equations with integer roots between -20 and 20, and run the prompt 20 times with each model. The results were as follows:

╔═══════════╦══════════╦
║   Model   ║ Accuracy ║     
╠═══════════╬══════════╬
║ ChatGPT   ║   80%    ║
║ Vicuna    ║    0%    ║ 
║ MPT       ║    0%    ║
╚═══════════╩══════════╩

While GPT makes a few mistakes, Vicuna and MPT did not solve a single quadratic equation correctly, often making mistakes in intermediate steps (MPT typically does not even write intermediate steps). Here is an example of a ChatGPT mistake:

ChatGPT makes a calculation error at the last step, where (13 +- 25) /2 should yield [19, -6]rather than [19.5, -6.5].
Now, since Vicuna and MPT failed on quadratic equations, we look at even simpler equations, such as x - 10 = 0. For these equations, we get these numbers:

╔═══════════╦══════════╦
║   Model   ║ Accuracy ║     
╠═══════════╬══════════╬
║ ChatGPT   ║   100%   ║
║ Vicuna    ║  92.8%   ║ 
║ MPT       ║  23.0%   ║
╚═══════════╩══════════╩

Here is an example of a mistake from MPT:

Discussion

This was a very toy task, but served as an example of how to compare models with different chat syntax using the same prompt. For this particular task / prompt combination, ChatGPT far surpasses Vicuna and MPT in terms of accuracy (measured on ground truth).

Task: extracting snippets + answering questions about meetings

We now turn to a more realistic task, where evaluating accuracy is not as straightforward. Let’s say we want our LLM to answer questions (with the relevant conversation segments for grounding) about meeting transcripts.
This is an application where some users might prefer to use open-source LLMs rather than commercial ones, for privacy reasons (e.g. some companies might not want to send their meeting data to OpenAI).

Here is a toy meeting transcript to start with:

Meeting Transcript:
John: Alright, so we’re all here to discuss the offer we received from Microsoft to buy our startup. What are your thoughts on this?
Lucy: Well, I think it’s a great opportunity for us. Microsoft is a huge company with a lot of resources, and they could really help us take our product to the next level.
Steven: I agree with Lucy. Microsoft has a lot of experience in the tech industry, and they could provide us with the support we need to grow our business.
John: I see your point, but I’m a little hesitant about selling our startup. We’ve put a lot of time and effort into building this company, and I’m not sure if I’m ready to let it go just yet.
Lucy: I understand where you’re coming from, John, but we have to think about the future of our company. If we sell to Microsoft, we’ll have access to their resources and expertise, which could help us grow our business even more.
Steven: Right, and let’s not forget about the financial benefits. Microsoft is offering us a lot of money for our startup, which could help us invest in new projects and expand our team.
John: I see your point, but I still have some reservations. What if Microsoft changes our product or our company culture? What if we lose control over our own business?
Steven: You know what, I hadn’t thought about this before, but maybe John is right. It would be a shame if our culture changed.
Lucy: Those are valid concerns, but we can negotiate the terms of the deal to ensure that we retain some control over our company. And as for the product and culture, we can work with Microsoft to make sure that our vision is still intact.
John: But won’t we change just by virtue of being absorbed into a big company? I mean, we’re a small startup with a very specific culture. Microsoft is a huge corporation with a very different culture. I’m not sure if the two can coexist.
Steven: But John, didn’t we always plan on being acquired? Won’t this be a problem whenever?
Lucy: Right
John: I just don’t want to lose what we’ve built here.
Steven: I share this concern too

Let’s start by just trying to get ChatGPT to solve the task for us. We’ll test in on the question ‘How does Steven feel about selling?’. Here is a first attempt at a prompt

qa_attempt1 = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
Here is a meeting transcript:
----
{{transcript}}
----
Please answer the following question:
Question: {{query}}
Extract from the transcript the most relevant segments for the answer, and then answer the question.
{{/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

While the response is plausible, ChatGPT did not extract any conversation segments to ground the answer (and thus fails our specification). We actually iterate through 5 different prompts in the notebook, but we’ll only show a couple here as examples, for the sake of discussion.
Here is prompt iteration #3:

qa_attempt3 = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
Here is a meeting transcript:
----
{{transcript}}
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.

As an example of output format, here is a fictitious answer to a question about another meeting transcript.
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you are a bad person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

ChatGPT did extract relevant segments, but it did not follow our output format (it did not summarize each segment, nor did it have the participant’s names). After a couple more iterations, here is prompt iteration #5, where we place the one-shot example as a separate conversation round and create a fake meeting transcript for it. That finally does the trick:

qa_attempt5 = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: What were the main things that happened in the meeting?
Here is a meeting transcript:
----
Peter: Hey
John: Hey
Peter: John, how is the weather today?
John: It's raining.
Peter: That's too bad. I was hoping to go for a walk later.
John: Yeah, it's a shame.
Peter: John, you are a bad person.
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.
{{/user}}
{{#assistant~}}
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you are a bad person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{~/assistant~}}
{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
Here is a meeting transcript:
----
{{transcript}}
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract from the transcript whichever conversation segments are most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns.
Please extract at most 3 segments. If you need less than three segments, you can leave the rest blank.
{{~/user}}

{{#assistant~}}
{{gen 'answer'}}
{{~/assistant~}}''')

qa_attempt5(llm=chatgpt, transcript=meeting_transcript, query=query1)

The reason we needed five (!) prompt iterations is that the OpenAI API does not allow us to do partial output completion yet (i.e. we can’t specify how the assistant begins to answer), and thus it’s hard for us to guide the output.
If, instead, we use one of the open source models, we can guide the output more clearly, forcing the model to use our structure.
For example, here is how we might modify qa_attempt3 so that the output format is specified:

qa_guided = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will read a meeting transcript, then extract the relevant segments to answer the following question:
Question: {{query}}
----
{{transcript}}
----
Based on the above, please answer the following question:
Question: {{query}}
Please extract the three segment from the transcript that are the most relevant for the answer, and then answer the question.
Note that conversation segments can be of any length, e.g. including multiple conversation turns. If you need less than three segments, you can leave the rest blank.

As an example of output format, here is a fictitious answer to a question about another meeting transcript:
CONVERSATION SEGMENTS:
Segment 1: Peter and John discuss the weather.
Peter: John, how is the weather today?
John: It's raining.
Segment 2: Peter insults John
Peter: John, you are a bad person.
Segment 3: Blank
ANSWER: Peter and John discussed the weather and Peter insulted John.
{{/user}}

{{#assistant~}}
CONVERSATION SEGMENTS:
Segment 1: {{gen 'segment1'}}
Segment 2: {{gen 'segment2'}}
Segment 3: {{gen 'segment3'}}
ANSWER: {{gen 'answer'}}
{{~/assistant~}}''')

If we run this prompt with Vicuna, we get the right format the first time around (and all the time):

We can, of course, run the same prompt with MPT:

While MPT follows the format, it ignores the question and takes snippets from the format example rather than from the real transcript.
From now on, we’ll just compare ChatGPT and Vicuna.

Let’s try another question: “Who wants to sell the company?”

Here is ChatGPT:

Vicuna:

Both seem to work really well. Let’s switch the meeting transcript to the first few minutes of an interview with Elon Musk. The relevant portion for the question we’ll ask is

Elon Musk: Then I say, sir, that you don’t know what you’re talking about.
Interviewer: Really?
Elon Musk: Yes. Because you can’t give a single example of hateful content. Not even one tweet. And yet you claimed that the hateful content was high. That’s false.
Interviewer: No. What I claimed-
Elon Musk: You just lied.

Then we ask the following question:
“Does Elon Musk insult the interviewer?”

ChatGPT:

Vicuna:

Vicuna, has the right format and even the right segments, but it surprisingly generates a completely wrong answer, when it says “Elon musk does not accuse him of lying or insult him in any way”.

We tried a variety of other questions and conversations, and the overall pattern was that Vicuna was comparable to ChatGPT on most questions, but got the answer wrong more often than ChatGPT did.

Task: do things with bash

Now we try to get these LLMs to iteratively use a bash shell to solve individual problems. Whenever they issue a command, we run it and insert the output back into the prompt, until the task is solved.

Here is the ChatGPT prompt (notice that shell this.commandcalls a user-defined function with this.command as argument):

terminal = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
Please complete the following task:
Task: list the files in the current directory
You can give me one bash command to run at a time, using the syntax:
COMMAND: command
I will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.
{{/user}}

{{#assistant~}}
COMMAND: ls
{{~/assistant~}}

{{#user~}}
Output: guidance project
{{/user}}

{{#assistant~}}
The files or folders in the current directory are:
- guidance
- project
DONE
{{~/assistant~}}

{{#user~}}
Please complete the following task:
Task: {{task}}
You can give me one bash command to run at a time, using the syntax:
COMMAND: command
I will run the commands on my terminal, and paste the output back to you. Once you are done with the task, please type DONE.
{{/user}}

{{#geneach 'commands' stop=False}}
{{#assistant~}}
{{gen 'this.command'}}
{{~/assistant~}}

{{~#user~}}
Output: {{shell this.command)}}
{{~/user~}}
{{/geneach}}''')

We created a dummy repo in ~/work/project, with file license.txt (not the standard LICENSE file name).
Without communicating this to ChatGPT, let's see if it can figure it out, when told to ‘Find out what license the open source project located in ~/work/project is using.’:

Indeed, ChatGPT follows a very natural sequence, and solves the task. It does not follow our instruction to say DONE, but we are able to stop the iteration automatically because it does not issue any COMMANDs.

For the open source models, we write a simpler (guided) prompt where there is a sequence of command-output:

guided_terminal = guidance('''{{#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
Please complete the following task:
Task: list the files in the current directory
You can run bash commands using the syntax:
COMMAND: command
OUTPUT: output
Once you are done with the task, use the COMMAND: DONE.
{{/user}}

{{#assistant~}}
COMMAND: ls
OUTPUT: guidance project
COMMAND: DONE 
{{~/assistant~}}

{{#user~}}
Please complete the following task:
Task: {{task}}
You can run bash commands using the syntax:
COMMAND: command
OUTPUT: output
Once you are done with the task, use the COMMAND: DONE.
{{~/user}}

{{#assistant~}}
{{#geneach 'commands' stop=False ~}}
COMMAND: {{gen 'this.command' stop='\\n'}}
OUTPUT: {{shell this.command)}}{{~/geneach}}
{{~/assistant~}}''')

Here is Vicuna:

Here is MPT:

In an interesting turn of events, Vicuna is unable to solve the task, but MPT succeeds. Besides privacy (we’re not sending the session transcript to OpenAI), open-source models have a significant advantage here: the whole prompt is a single LLM run (and we even accelerate it by not having it geneate the output structure tokens like COMMAND:).
In contrast, we have to make a new call to ChatGPT for each command, which is slower and more expensive.

Now we try a different command: “Find all jupyter notebook files in ~/work/guidance that are currently untracked by git”.

Here is ChatGPT:

Once again, we run into a problem with ChatGPT not following our specified output structure (and thus making it impossible for us to use inside a program, without a human in the loop). Our program just executed commands, and thus it stopped after the last ChatGPT message above.

We suspected that the empty output threw ChatGPT off, and thus we fixed this particular problem by changing the message when there is no output. However, we can’t fix the general problem of not being able to force ChatGPT to follow our specified output structure.

ChatGPT was able to solve the problem after this small modification. Let’s see how Vicuna does:

Vicuna follows our output structure, but unfortunately runs the wrong command to do the task. MPT (not shown) calls git status repeatedly, so it also fails.

We ran these programs for various other instructions, and found that ChatGPT almost always produces the correct sequence of commands, while sometimes not following the specified format (and thus needing human intervention). The open source models didn’t work so well (we can probably improve them with more prompt engineering, but they failed on most harder instructions).

Takeaways

In addition to the examples above, we tried various inputs for both tasks (question answering and bash). We also tried a variety of other tasks involving summarization, question answering, “creative” generation, and toy string manipulation tasks where we can evaluate accuracy automatically.
Here is a summary of our findings:

Quality on task: For every task we tried, ChatGPT (3.5) is still stronger than Vicuna on the task itself. MPT performed poorly on almost all tasks (perhaps we are using it wrong?), while Vicuna was often close to ChatGPT (sometimes very close, sometimes much worse as in the last example task above).
Ease of use: It is much more painful to get ChatGPT to follow a specified output format, and thus it is harder to use it inside a program (without a human in the loop). Further, we always have to write regex parsers for the output (as opposed to Vicuna, where parsing a prompt with clear syntax is trivial).
We are typically able to solve the structure problem adding more few-shot examples, but it is tedious to write them, and sometimes ChatGPT goes off-script anyway. We also end up with prompts that are longer, clumsier, and uglier, which is unsatisfying.
Being able to specify the output structure is a significant benefit of open-source models, to the point that we might sometimes prefer Vicuna over ChatGPT even when it is a little worse on the task itself.
Efficiency: having the model locally means we can solve tasks in a single LLM run (guidance keeps the LLM state while the program is executing), which is faster and cheaper. This is particularly true when any substeps involve calling other APIs or functions (e.g. search, terminal, etc), which always requires a new call to the OpenAI API. guidance also accelerates generation by not having the model generate the output structure tokens, which sometimes makes a big difference.

In summary, our preliminary assessment is that MPT is not ready for real-world use yet (unless we’re using it wrong), and that Vicuna is a viable (weaker) alternative to ChatGPT (3.5) for many tasks — in part due to the ability to specify the output structure. Now, it may be that these findings don’t generalize, and are instead specific to the tasks and inputs we tried (or to the kinds of prompts we tend to write). We acknowledge that this is just preliminary exploration, not an attempt at formal evaluation.
However, we think that anyone who tries to use LLMs for real-world tasks will start with something like this to figure out which LLM is stronger for their use case / preferred prompt style (in addition to considerations of cost, privacy, model versioning, etc).

We should acknowledge that we are biased by having used OpenAI models a lot in the past few years, having written various papers that depend on GPT-3 (e.g. here, here), and a paper that is basically saying “GPT-4 is awesome, here are a bunch of cool examples”.
Speaking of which, while Vicuna is somewhat comparable to ChatGPT (3.5), we believe GPT-4 is a much stronger model, and are excited to see if open source models can approach that. While guidance plays quite well with OpenAI models, it really shines when you can specify the output structure and accelerate generation.

Again, we are clearly biased, but we think that guidance is a great way to use these models, whether with APIs (OpenAI, Azure) or locally (huggingface). Here is a link to the jupyter notebook with code for all the examples above (and more).

Disclaimer: this post was written jointly by Marco Tulio Ribeiro and Scott Lundberg. It strictly represents our personal opinions, and not those of our employer (Microsoft).

Acknowledgments: We are really thankful to Harsha Nori for insightful comments on this post

Exploring ChatGPT vs open-source models on slightly harder tasks

Warmup: Solving equations

Task: extracting snippets + answering questions about meetings

Task: do things with bash

Takeaways

Written by Marco Tulio Ribeiro