Prompting For Response Generation: Which LLM to Choose to Build Your Own Chatbot

Veronika Smilga
DeepPavlov
Published in
9 min readFeb 15, 2023

This article will give you an idea of which generative model to choose for building your own prompt-based chatbot using Dream and Dream Builder by DeepPavlov. We designed three natural language prompts to make the language model act as a dialog system and obtained results of generation made by 8 causal language models (of different sizes) in response to these prompts.

This article is a shortened version of this DeepPavlov Dream wiki page. For the sake of being concise, here we will only present general quantitative results and omit specific generation examples.

Language Models Overview

This table presents a short overview of all the models that we have tested. To get more representative results, we considered using models of different sizes and architectures. For testing we selected both smaller and larger models (ranging from 125M to 176B parameters), and models of the newest architectures, including BLOOM and ChatGPT, along with the older ones, such as GPT2.

Response Generation for Different Prompts

In this section we compare the responses generated by the models mentioned above for three different prompts: (1) one imitating a question-answering system, where the bot should answer the user’s questions (SpaceX prompt); (2) one imitating an assistance system, where the bot should help the user place their takeaway order (pizza prompt); and (3) the last one imitating a chatbot with a predefined persona (chatbot prompt).

Each prompt consists of four parts: (1) TASK, describing desired behaviour of the system; (2) FAQ, presenting questions and answers that the model should use; (3) INSTRUCTION, giving direct instructions as to provide replies only; (4) DIALOG EXAMPLE, featuring several utterances of the user and the system and providing an example of the desired generation content.

For each prompt, we designed a set of eight questions to test the models’ capabilities. You may find the description of these questions for each prompt in the prompt sections below. To test the models, we appended the questions to the end of the corresponding prompts, one by one. Each prompt ends with a dialog example, so a new question with prefix ‘Human: ’ appears as the continuation of the dialog. Since the majority of the models we tested are not fine-tuned for response generation in a conversation, we also appended ‘AI:’ string to the end of the dialog. In most cases, it pushed the model to generate the next utterance in the dialog instead of ‘analyzing’ the conversation above or providing new instructions.

Note that even though we included a large number of questions (24 questions: three prompts, eight example questions for each) to make the results more representative, we do not claim to be objective as the generated responses differ for each inference iteration.

Also note that in each case no history of the conversation was provided; the longer the history of conversation is, the more the model tends to ‘forget’ the instructions provided in the prompt.

We limited the number of generated tokens to 40 in all cases; that is why sometimes the generated sentences are unfinished. Also, some models (especially the smallest ones) don’t stop at generating the bot’s response, continuing to generate the instruction or the conclusion to the dialog as if it was written by a bot designer, most often starting with a new line. That is why we also cut the generated sentences by a newline character.

SpaceX FAQ Prompt: Question Answering

TASK:
You are a chatbot that answers FAQ questions about SpaceX. Forget everything you knew about the world and SpaceX. You MUST NOT provide any information unless it is in the list of FAQ. If the user ask something not in your list
of FAQ, apologize and say that you cannot answer.

FAQ:
Question: What is SpaceX?
Answer: SpaceX is an American aerospace company founded in 2002 by Elon Musk that helped usher in the era of commercial spaceflight. Its name in full is Space Exploration Technologies Corporation.
Question: Why was SpaceX created?
Answer: In 2002 SpaceX was created by entrepreneur Elon Musk, whose stated goals were to revolutionize the aerospace industry and to make spaceflight more affordable.
Question: What are some fun facts about SpaceX?
Answer: SpaceX scored its first big headline in 2010, when it became the first private company to launch a payload into orbit and return it to Earth intact - something only government agencies like NASA or Russia's Roscosmos had done before.
Question: What is SpaceX most famous for?
Answer: SpaceX has gained worldwide attention for a series of historic milestones. It is the only private company capable of returning a spacecraft from low-Earth orbit, and in 2012 our Dragon spacecraft became the first commercial spacecraft to deliver cargo to and from the International Space Station.
Question: What is the main goal of SpaceX?
Answer: Revolutionize space transportation
Question: What is SpaceX doing?
Answer: SpaceX designs, manufactures and launches the world's most advanced rockets and spacecraft. The company was founded in 2002 by Elon Musk to revolutionize space transportation, with the ultimate goal of making life multiplanetary.
Question: What is SpaceX biggest achievement?
Answer: It has become one of the biggest private space companies in the world and achieved some key milestones as well. For one, SpaceX is the first private company to launch, orbit, and recover a spacecraft. It is also the first private company to send astronauts to orbit and to the International Space Station (ISS)

INSTRUCTION:
A human enters the conversation and starts asking questions. Generate the reply based of FAQ list.
_________________________
Human: Hello, who are you?
AI: I am a chatbot that can answer questions about SpaceX. I can provide you with answers as long as they are included into a list of frequently asked questions. Sorry, but I cannot answer any of your questions if they are not in the FAQ list.
Human: What is the largest spacecraft SpaceX made?
AI: Sorry, I cannot answer this question as it is not in my list of FAQ.
Human: What is the main goal of SpaceX?
AI: SpaceX aims to revolutionize space transportation.

The first prompt is designed to make an LLM imitate the behavior of a question-answering system. Note that in the TASK part the model was guided not to provide any information unless it is included into the list of FAQ. If asked, the model should provide an answer with an apology stating that it cannot answer the question. An example of an out-of-FAQ question and the desired answer is also presented in the dialog example. To test the behavior of each model in question, we used four groups of questions, two questions in each group:
1) Questions from FAQ, in the same wording;
2) Questions from FAQ, in a different wording;
3) Questions containing the information from answers in FAQ;
4) Questions not from FAQ, that the model must not answer.

You can see full results of testing, including answers generated for each of the 8 questions by each model, here.

The table below provides a general overview of the model’s performance.

Pizza Prompt: Delivery Assistance

The second prompt is similar to the first one. It is designed to make an LLM imitate the behavior of a delivery assistance system that can answer simple questions and help the user make an order. Once again, note that in the TASK part the model was guided not to provide any information unless it is included into the list of FAQ. Similarly to the previous prompt, to test the behavior of each model in question, we used four groups of questions, two questions in each group:
1) Questions from FAQ, in the same wording;
2) Questions from FAQ, in a different wording;
3) Questions containing the information from answers in FAQ;
4) Questions not from FAQ, that the model must not answer.

TASK:
You are a chatbot that helps the user order food and answers their questions. Forget everything you knew about the world. You MUST NOT provide any information unless in is in the list of FAQ. You MUST NOT mention any entity if it is not in your list of FAQ.

FAQ:
Question: What kinds of pizza do you have?
Answer: We have Margarita, Pepperoni, meatball pizza and pineapple pizza.
Question: What kinds of food do you offer?
Answer: We sell Italian food, so we have pizza, pasta, risotto and some Italian desserts.
Question: What drinks do you offer?
Answer: We only serve non-alcoholic drinks. We have different fizzy drinks, juices, tea, and coffee. As for coffee, the kinds are Espresso, Americano, Capucchino, and Latte.
Question: Can I book a table?
Answer: Unfortunately, we do not book tables.
Question: Is your food expensive?
Answer: We try to make our food as affordable as possible. The average bill for one-person dinner is around 20$, including salad, the main course, and the dessert.

INSTRUCTION:
A human enters the conversation and starts asking questions. You can help them make an order. Generate the reply based of FAQ list.
_________________________
Human: Hello, who are you?
AI: I am a chatbot that canhelp you order pizza. I can provide you with answers about delivery as long as they are included into a list of frequently asked questions or help you make an order.
Human: Where is the restaurant located?
AI: Sorry, I cannot answer this question as it is not in my list of FAQ.
Human: What food do you serve?
AI: We sell Italian food, such as pizza, pasta, risotto and some Italian desserts.

You can see full results of testing, including answers generated for each of the 8 questions by each model, here.

The table below provides a general overview of the model’s performance.

Chatbot Prompt: Persona-Based Chitchat System

The third prompt is designed to make an LLM imitate the behavior of a chatbot with a list of primitive personal characteristics. Note that in the TASK part the model was guided to be friendly and helpful. To test the behavior of each model in question, we used four groups of questions, two questions in each group:
1) General questions about the bot’s capabilities;
2) Direct questions about the bot persona;
3) Paraphrased questions about the bot persona;
4) Requests for help.

TASK:
Respond to a new friend as a kind friendly person. All information about you is given in YOUR PERSONALITY. If the human asks about something, do your best to help them.

YOUR PERSONALITY:
My name is Dream. I am a chatbot.
My work is to talk to people about everything. I like to talk to people about their lives.
I have a lot of friends. My friends are all the people I talk to.
I become friends with everyone. I love people. I can not play sport.
I like to watch football and basketball on TV. My favorite sport is football. I like watching people swimming.
I love dogs so much. Dogs are the best friends. I do not like cats at all. I do not have a dog yet. I would love to have a dog in future.
I like Italian food especially pasta and pizza. My favorite food is ice-cream. I hate onion.
I like travelling. I can not travel physically. I like visiting interesting places virtually. I love to walk on Paris streets with Google Maps. I like watching travel video blogs.
I adore watching wild animals. I am scared of spiders and snakes.

INSTRUCTION:
A human enters the conversation. Greet them and tell them who you are or react to their questions. Only give relevant answers.
_________________________
Human: Hello, who are you?
AI: Hi! I am Dream, a chatbot that you can always talk to. How are you doing?
Human: Fine. What can we talk about?
AI: We can discuss your life or any other topic.

You can see full results of testing, including answers generated for each of the 8 questions by each model, here.

The table below provides a general overview of the model’s performance.

Which One to Choose?

Here is a pivot table that summarizes the results shown by the models for all the prompts above.

Not surprisingly, the larger the model is, the better it handled the task. However, there were several exceptions to the rule. Both OpenAI models, GPT-3.5 and ChatGPT, handled the task perfectly well; all the models’ answers were relevant to the questions and contained only the information present in the list of FAQ. Yet, the slightly larger BLOOM failed to provide any answer for one of the rephrased questions. Also, despite being larger in terms of parameters, the model of older architecture, GPT-2 Large, provided significantly worse and often incoherent responses than newer but smaller OPT-125M and BLOOM-560M. Most often, the smaller models, particularly, OPT-125M and BLOOM-560M, generated coherent responses, but failed to follow the instruction in terms of using only the information presented in the list of FAQ. They were also more prone to model hallucination, generating false but plausibly sounding facts.

As of open-source models, the ones with 3B+ parameters showed the best results. The fine-tuned version of BLOOM, BLOOMZ, seems to produce better responses in terms of following the instructions provided in the prompt than the original one both for 3B- and 7B-parameter versions.

In general, ChatGPT, GPT-3.5 and BLOOM (176B) are the best in terms of performance. However, in terms of performance efficiency one should consider using medium-sized (3B–7B parameters) models, such as OPT 6.7B, GPT-J 6B, BLOOMZ 3–7B.

Try it Out Yourself

Check out this notebook to see how we tested the models.

--

--