PromptGPT: Optimizing the Prompt for GPT-4

Dmitrii Lukianov
16 min readMay 14, 2023

--

Suppose we are solving a seq2seq NLP problem. There is a great variety of such problems: text/code generation, translation/stylization, summarization, text correction, named entity recognition, and even text classification can be reduced to seq2seq.

If we’re short on time, we need a high-quality solution, and we can afford to pay for a commercial API, then GPT-4 API will do the job.

When solving a seq2seq problem using the GPT-4 API, we will need to formulate a prompt, which consists of a system message and a list of examples for the model (few-shot learning). You can find a GPT-4 API tutorial in my previous article.

We are faced with the problem of measuring the model’s performance and optimizing the prompt:

  1. Changes in the system message randomly affect the results
  2. It’s unclear which subset of examples we should choose
  3. The longer the prompt, the more expensive is the usage of the model. We want to find the prompt of the minimal length which leads to a good enough model performance.

Additionally, we can use either GPT-3.5 or GPT-4 (if we get an access to it). GPT-4 is significantly more expensive. We want to determine if GPT-4 will enhance the quality sufficiently to justify its cost.

I have developed a small tool that measures the LLM quality on various prompts, helping you find the optimal one: https://github.com/einhornus/prompt_gpt. In this article I’ll first explain how it works, and then demonstrate the process of prompt optimization using the specific example — the grammar correction task from my previous article.

Solving NLP Problems with GPT-4 API

The API has a ChatCompletion method that takes a list of messages (the prompt), and returns the next message in the dialogue.

Each message in this list is a dictionary with two fields: role and content.

Messages are classified into 3 types based on their role value: system messages (role=”system”), user messages (role=”user”), and assistant messages (role=”assistant”).

The system message contains high-level instructions for the assistant explaining how it should behave in general. In the system message we try to explain to the model what we want do we want from it in plain language.

If there are several system messages in a prompt, only the last one will matter (new system messages overwrite old ones).

Messages from the user and the assistant form the preceding dialogue. However, these messages can also be used to demonstrate examples of solving the problem (a method of learning from a small number of examples — few-shot learning). We’re going to take advantage of that. Each user message will contain an example of an input and the assistant’s response message will contain the expected output. These examples are followed by a user message with the input we need a solution for. This way, the most natural continuation of the chat is the message containing the solution to our problem.

For example, if we want to build an English to Spanish machine translator, its prompt could look like this:

[
{
"role": "user",
"content": "Hello how are you?"
},
{
"role": "assistant",
"content": "Hola, ¿cómo estás?"
},
{
"role": "user",
"content": "Despite the heavy rain, they decided to continue their hike through the dense forest"
},
{
"role": "assistant",
"content": "A pesar de la fuerte lluvia, decidieron continuar su caminata a través del denso bosque"
},
{
"role": "user",
"content": "The chef, inspired by flavors from around the world, has created a unique fusion cuisine that attracts food enthusiasts and critics alike"
},
{
"role": "assistant",
"content": "El chef, inspirado en sabores de todo el mundo, ha creado una cocina de fusión única que atrae a entusiastas y críticos gastronómicos por igual"
},
{
"role": "system",
"content": "You are TranslateGPT. You translate user messages from English to Spanish. You are the most accurate English to Spanish translator in the world."
},
{
"role": "user",
"content": "{a message to translate}"
}
]

I discovered that placing the system message near the end of the prompt is more effective, as the model will pay greater attention to it.

For the same reason, it is also better to place better examples at the end of the prompt.

If you send an API query with this prompt, the response will contain an assistant message with the translation of our text.

The following code translates the text from English to Spanish:

import openai
import os


def translate(text):
openai.api_key = os.environ.get("OPENAI_API_KEY")
completion = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": "Hello how are you?"
},
{
"role": "assistant",
"content": "Hola, ¿cómo estás?"
},
{
"role": "user",
"content": "Despite the heavy rain, they decided to continue their hike through the dense forest"
},
{
"role": "assistant",
"content": "A pesar de la fuerte lluvia, decidieron continuar su caminata a través del denso bosque"
},
{
"role": "user",
"content": "The chef, inspired by flavors from around the world, has created a unique fusion cuisine that attracts food enthusiasts and critics alike"
},
{
"role": "assistant",
"content": "El chef, inspirado en sabores de todo el mundo, ha creado una cocina de fusión única que atrae a entusiastas y críticos gastronómicos por igual"
},
{
"role": "system",
"content": "You are TranslateGPT. You translate user messages from English to Russian. You are the most accurate English to Russian translator in the world."
},
{
"role": "user",
"content": text
}
],
temperature=0
)
return completion["choices"][0]["message"]["content"]

You can find the complete reference on the API here and play around with the API in the playground.

How Does it Work

We have two datasets: a training set and a test set. Each item in both datasets contains two fields: input and output.

The training set is used to create examples for the prompt, while the test set is used to measure the quality of the model.

To measure the quality of the model on a particular prompt, we run the model on each item of the test set, and find a similarity score between the model output (actual output) and the output field of the item (expected output). By the statistical analysis of the distribution of scores on different prompts, we can choose the optimal prompt.

To save time and money, we store the results of running the model in a cache.

We would also like to vary the system message: we have a set of possible system messages in a separate folder (each one is in its own file), and we can put each of those messages in the prompt.

We create the list of examples as follows: we set the value of the parameter k (the maximum total number of characters in the set of examples), and proceed from the beginning of the training set to its end. If the current item can be added to the set of examples, we add it; otherwise, we skip it. At the end of the procedure, the list of examples is reversed, so that the better examples are closer to the end.

The performance of the example set often depends on the order in which the items of the training set were initially placed (and it is very hard to predict). So, I decided to shuffle the training set n times: the first time, they are arranged in the default order, and the next n-1 times, they are randomly shuffled (the seed is equal to the iteration number). This way, it is possible to run the model multiple times with different sets of examples and find the best one.

Creating the Reports

We have a create_report function that generates a .json report about the model performance. It receives the following arguments:

  1. project_name; the name of the project
  2. model; gpt-3.5-turbo or gpt-4
  3. system_message_file; the name of the file containing the system message
  4. parameters; this argument requires special attention. I’ll discuss it later on in the “parameters” section
  5. k; maximum total number of characters in the set of
  6. n; number of example sets randomized on each iteration
  7. metric; the metric used to calculate the similarity score between the expected and the actual results. Currently, only BLEU is supported. BLUE is the gold standard in estimating the quality of the machine translation. I have plans to add some other metrics like the Levenshtein distance or the cosine similarity between textual embeddings obtained by different language models (BERT, GPT-3 embeddings, and so on).

The function creates a .json report, the file name encodes the arguments, for example: “system=1 language=English model=gpt-3.5-turbo k=0 metric=bleu.json”.

The report contains the results for the best of the n runs — the one with the highest average score.

The report includes:

  1. create_report arguments
  2. Mean, median and the percentiles of the score distribution
  3. Items of the test set sorted by the score value (from best to worst)
  4. The prompt as a list of messages
  5. history, the mean score on each on the n runs

You can find an example of a report here.

Parameters

For each item in the training/test set, we also store a list of their “parameters”. Combinations of parameter values divide datasets into independent subsets; all the reports are generated on subsets with items having the same value for all parameters.

For example, if we want to create a machine translation model, then the parameters will be the source and target languages, because we need to make separate sets of examples for each language pair. Each report will be related to a specific language pair, defined by the values of these two parameters.

When writing the system message, you can use the parameter values, for example: “You translate messages from {source_language} to {destination_language}”.

Visualization

The visualize function displays the data from several reports as bot plots.

The visualization example

Boxes are drawn using the boxplot function from matplotlib, so the box bounds represent the 25th and 75th percentiles, with the bold line in the center representing the median. The triangular notch in the center indicates the confidence interval for the median.

visualize receives the following arguments:

  1. project_name; the name of the project
  2. system_message_files ; the array of possible values for system_message_file argument
  3. models; the array of possible values for model argument
  4. parameters_configurations; the array of possible values for the parameters argument
  5. ks; the array of possible values for k
  6. metric; the metrics used

The function filters the reports: it draws box plots only for the reports for which all the arguments fall within those ranges.

The chart title includes the argument values shared among all the boxes. The label under each box lists the argument values that is specific to the box.

The Project Structure

The Project Structure

In the root of the data folder, we create a folder with the name of our project, which will store all the data related to the project.

The data folder also contains the llm_cache.sqlite3, which is used to cache LLM requests and dataset.sqlite3, an empty dataset template that you copy into your project folder and fill with your own data,

The system folder contains files with different variations of the system message. The name of the message is the name of its corresponding file without .txt.

The reports folder stores generated reports.

dataset.sqlite3 is an SQLite database with two tables: train and test. The train table contains the elements of the training set, while the test table contains the elements of the test set.

Both tables have the same schema:

CREATE TABLE "train" (
"id" INTEGER NOT NULL UNIQUE,
"name" TEXT,
"parameters" TEXT,
"input" TEXT,
"output" TEXT,
PRIMARY KEY("id" AUTOINCREMENT)
)
  1. id; primary key. In the first iteration of generating examples, the elements of the train set are arranged by the id. So, it makes sense to put the best examples at the beginning of the train set. However, when n>1 then on each iteration except the first one the elements are randomly shuffled and the original order does not matter
  2. name; the name of the item. It’s only there for the convenience, it doesn’t make any difference
  3. parameters; the parameter values separated by a line break
  4. input; the input message for the model
  5. output; the expected output of the model

The src folder contains the code:

  1. create_reports.py ; the script for creating the reports, which contains the create_report function
  2. visualize.py; the visualization script, which contains the visualize function
  3. llm_utils.py; contains the code for calling the API and for the caching
  4. metrics.py; the code is used to calculate the similarity scores

Project dependencies are:

  1. openai; for calling the LLM
  2. nltk; for the tokenization and calculating the BLEU score
  3. matplotlib; for the visualization
  4. numpy; for statistical computations

Grammar Corretion

In this chapter I will show you how to optimize the prompt for the specific seq2seq problem (grammar correction).

API Key

First we need to create an OpenAI API key there and put the key into an environment variabe called OPENAI_API_KEY.

Creating the Dataset

Lang8 is a website where the native language speakers correct the mistakes in messages written by language learners. The website no longer operates. However, it is possible to download the language correction data accumulated over many years,

I have heavily preprocessed the data and got this dataset as a result: for each of the 6 languages (English, Spanish, German, French, Russian, Dutch), I have 30–90 items in the test set and half as many items in the train set.

In this project, we will have one parameter — the language. We will conduct the majority of tests on the English subset only (language=English) in order to save time and money.

The vast majority of Lang8 users are native speakers of Japanese, Chinese, and Korean. I have decided to balance out speakers of different languages. For example, in the English subset, there are 15 tests from native speakers of English, Spanish, German, French, Russian, Japanese, Chinese and Korean, plus 15 texts from speakers of other languages. The name field encodes the language of the message and also the native language of the message author.

Selecting the System Message

Let’s choose the optimal system message.

I have created six different system messages and placed them in the correct_grammar/system folder: 1.txt, 2.txt, …, 6.txt.

The first and the second one are those that I used in my previous article for two different types of correction.

1.txt:

You’re ImproveGPT.

You improve the provided {language} text language-wise: you fix grammar and spelling mistakes and make it sound more natural.

Your reply should contain ONLY the corrected text, nothing else.

Please use exactly the same formatting as the original text.

2.txt:

You’re CorrectGPT.

You fix grammar and spelling mistakes in {language} texts.

Please only fix grammar and spelling mistakes in the given user message.

Your reply should contain ONLY the corrected text, nothing else.

Please use exactly the same formatting as the original text.

In the third and forth messages, I have decided to add “You’re the best proofreading tool in the world.”

3.txt:

You’re ImproveGPT.

You’re the best proofreading tool in the world.

You improve the provided {language} text language-wise: you fix grammar and spelling mistakes and make it sound more natural.

Your reply should contain ONLY the corrected text, nothing else.

Please use exactly the same formatting as the original text.

4.txt:

You’re CorrectGPT.

You’re the best proofreading tool in the world.

You fix grammar and spelling mistakes in {language} texts.

Please only fix grammar and spelling mistakes in the given user message.

Your reply should contain ONLY the corrected text, nothing else.

Please use exactly the same formatting as the original text.

The fifth message expands upon the second one by adding more details

5.txt:

You’re CorrectGPT.

You fix grammar and spelling mistakes in {language} texts.

You’re the best proofreader in the world.

You can fix any grammar and spelling mistakes in any text.

The user supplies a text in {language} and you have to write a reply with the same text but with all grammar and spelling mistakes fixed.

You’re allowed to make slight changes to the text so it sounds more natural, but it should still have the same meaning as the original.

Your reply should contain ONLY the corrected text, nothing else.

Please use exactly the same formatting as the original text.

I made the sixth message as concise as possible.

6.txt:

You fix grammar and spelling mistakes in {language} texts.

Let’s create a report for each message (with n=10 and k = 0 (no examples) or k=2000).

To do this, we run the following code from create_reports.py

for i in range(1, 7):
create_report("grammar_correction", "gpt-3.5-turbo", str(i), {"language": "English"}, 2000, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", str(i), {"language": "English"}, 0, "bleu", 10)

Please note that report generation is not very fast. For each of the 6 system messages we run GPT-3.5 10 times on each of 90 items in the test set. It took an entire night and consumed several dollars for API usage.

Let’s visualize the results with k=0 and with k=2000 by calling the code in visualize.py:

visualize("grammar_correction",
["1", "2", "3", "4", "5", "6"],
["gpt-3.5-turbo"],
[ {"language": "English"}, ],
[0],
"bleu"
)

visualize("grammar_correction",
["1", "2", "3", "4", "5", "6"],
["gpt-3.5-turbo"],
[ {"language": "English"}, ],
[2000],
"bleu"
)

If k=2000 all the prompts except for the last one show very similar results, however if k=0 the difference is quite clear.

We can conclude that in our task it’s crucial to formulate the system message well when there are no examples. If we do have examples however the exact wording of the system message becomes much less important, as the model will figure out what it needs to do using those examples.

I have decided to use the system message number 4 because it shows the best performance when k=0, and when k=2000, it has the lowest standard deviation of the score.

How Does the Performance Depend on k

Let’s now vary k (without changing anything else)

create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "English"}, 500, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "English"}, 1000, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "English"}, 2000, "bleu", 10)
visualize("grammar_correction",
["4"],
["gpt-3.5-turbo"],
[
{"language": "English"},
],
[0, 500, 1000, 2000],
"bleu"
)

The model’s performance improves only slightly as we increase the number of examples.

GPT-3.5 vs GPT-4

Let’s generate a report using the GPT-4 model

create_report("grammar_correction", "gpt-4", "4", {"language": "English"}, 0, "bleu", 10)

Let’s visualize 3 scenarios: GPT-3.5 with k=0 and k=2000, and GPT-4 with k=0

visualize("grammar_correction",
["4"],
["gpt-3.5-turbo", "gpt-4"],
[
{"language": "English"},
],
[0, 2000],
"bleu"
)

GPT-4 is evidently superior: without examples, it outperforms GPT-3.5, which uses 2000 characters in total for examples.

Other Languages

Now, let’s examine the performance in other languages

create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "Spanish"}, 0, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "French"}, 0, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "German"}, 0, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "Dutch"}, 0, "bleu", 10)
create_report("grammar_correction", "gpt-3.5-turbo", "4", {"language": "Russian"}, 0, "bleu", 10)
visualize("grammar_correction",
["4"],
["gpt-3.5-turbo"],
[
{"language": "English"},
{"language": "German"},
{"language": "French"},
{"language": "Spanish"},
{"language": "Dutch"},
{"language": "Russian"},
],
[0],
"bleu"
)

The model performs poorly on Russian due to the inferior tokenization of Cyrillic characters.

Conclusions

The following system message appears to be the most suitable:

You’re CorrectGPT.

You’re the best proofreading tool in the world.

You fix grammar and spelling mistakes in {language} texts.

Please only fix grammar and spelling mistakes in the given user message.

Your reply should contain ONLY the corrected text, nothing else.

Please use exactly the same formatting as the original text.

If we have an access to GPT-4, the best results would be achieved with GPT-4 (GPT-4 without examples will outperform GPT-3.5 with examples).

If we’re using GPT-3.5, then we can boost the performance slightly by using examples. The optimal prompt can be taken from a file like “system=4 language=English model=gpt-3.5-turbo k=<500, 1000, or 2000> metric=bleu.json”. For example, if k=2000, the optimal prompt is as follows

[
{
"role": "user",
"content": "Please check my English questions\nPlease tell your shy story."
},
{
"role": "assistant",
"content": "Please check my English questions.\nPlease tell your shy story."
},
{
"role": "user",
"content": "I was disappointed.\nToday I went to a Japanese school where I am working as a volunteer every Sunday.\nI met there a Australian yong man who is studing Japanese.\nI asked him to tolk with me in English for a while.\nBut I could not understand well what he said.\nI was nervous with speaking English.\nI was frustrated it was useless my English study until now."
},
{
"role": "assistant",
"content": "I was disappointed.\nToday, I went to a Japanese school where I am working as a volunteer every Sunday.\nI met a young Australian man who is studying Japanese.\nI asked him to talk to me in English for a while.\nBut I could not understand what he said.\nI was nervous to speak English.\nI was frustrated; my English study until now was useless."
},
{
"role": "user",
"content": "Business\nYesterday i've seen an old friend, we had a small talk about the daily important things.\nThan I wanted to ask how long he has his business, but I don't know the right words,\nIs \"how long have you your business\" or \" how long have you been in that business\"\nThat drives me crazy..."
},
{
"role": "assistant",
"content": "Business\nYesterday, I saw an old friend. We had small talk about some daily important things.\nThen, I wanted to ask him for how long he has had his business, but I didn't know the right words.\nIs it \"for how long have you had your business\" or \"for how long have you been in the business\"?\nIt's driving me crazy..."
},
{
"role": "user",
"content": "Whenever I have a chance, I will try to write my diary.\nIt plans to write two times a day.\nIt's time at noon and I have a sleep before.\nThat is not a promise and It is a plan!\nI know that promises is important.\nSo, I don't promises easily."
},
{
"role": "assistant",
"content": "Whenever I get the chance, I will try to write in my diary.\nI plan to write two times a day.\nI will write at noon and before I sleep.\nThat is not a promise, it is a plan!\nI know that promises are important.\nSo I don't make promises easily."
},
{
"role": "system",
"content": "You're CorrectGPT.\nYou're the best proofreading tool in the world.\nYou fix grammar and spelling mistakes in English texts.\nPlease only fix grammar and spelling mistakes in the given user message.\nYour reply should contain ONLY the corrected text, nothing else.\nPlease use exactly the same formatting as the original text."
}
]

The model’s performance in Spanish, German, French, and Dutch is similar to its performance in English. However, the quality significantly degrades when it comes to Russian.

--

--