Generate Your Own Datasets to Experiment with Fine Tuning on GPT Models

Elbert Bae
BertCode
Published in
9 min readOct 31, 2023
Photo by Patrick Tomasso on Unsplash

Learning how to fine tune a model takes more than making a few requests to OpenAI. You need to understand the problem, figure out where the base model is lacking and then get the training data. Finding the right set of data can be challenging. There are a host of great data sources out there for your specific needs, but when you’re learning — generating your own is a great way to get started quickly.

Throughout this article, we’re going to look at a simple CLI tool I’ve created called oaift-nodejs. This is a wrapper around OpenAI APIs to make generating data and creating fine tuning jobs easier. It’s not a perfect tool and is still a work in progress, but I believe that it’s good enough for those starting their learning journey with fine tuning models.

To keep this simple, we’ll be replicating how I generated the dataset that was used in the sentiment recognition fine tuned model mentioned in this article about fine tuning.

Setup

The CLI program runs on NodeJs, but one of its scripts rely on python due to some code extracted from OpenAI’s cookbook. To get started, ensure that you have the following installed on your machine:

  • Python3
  • NodeJs
  • oaift-nodejs (install it globally with npm i -g oaift-nodejs)
  • Create an OpenAI API key on the terminal and set it to your environment variable OPENAI_API_KEY=sk_yourkey

Confirm you have the CLI command installed by running oaift -h . You should see this in your terminal:

Initiating a Project

First, we want to create a new directory to work in. So create a new directory using mkdir oaift-example and enter it with cd oaift-example .

One of the core ideas for this was to track everything inside of projects. So before we start anything, lets initialize a project with the init command:

oaft init --name my_project_name

This will create the following folder structure with two configuration files:

- projects
- my_project_name
- oaift_config.json
- oaigen_config.json

With these files created, we can get started with generating some datasets.

Generating Datasets

The configuration file that we’ll work with closely is oaigen_config.json which contains properties we can modify to customize our data generation process.

The default template comes with an example:

{
"system": "You are an assistant that is excited to provide assistance to new clients of your organization. The company called GroupStart is a consulting organization specializing in helping clients foster greater teamwork in their teams. You are conversational, empathtic, and professional in the services that you provide.",
"topics": [
"Overcoming difficult conversations with underperforming employees.",
"Providing support for deadlines.",
"Celebrating success with everyone on the team."
],
"count": 3,
"variables": {
"audience": "user"
},
"model": "gpt-3.5-turbo",
"template": "Generate conversations between you and the {{audience}} about the topic below and call convertToTrainingData({ conversation: MESSAGES[] }).\n\n{{topic}}"
}
  • system is the base prompt for chat completions. It will be the initial set of instructions that reinforces how your chat completions should behave.
  • topics is an array of values that will be used to create variety in your datasets.
  • count is the number of times the same prompt is used for multiple datasets if we want variations in similar conversations.
  • variables are properties that can replace placeholders in the template.
  • template is the main user prompt that will act as the first message to our chat completion request. This string interpolates values from variables and each item of topics . It is required that you wrap the placeholders with handlebars (e.g. {{VALUE}} ) and each item in the array of topics will be represented as {{topic}} .
  • model is the OpenAI model to use. You can experiment between GPT3.5 and GPT4 to see which one generates more accurate datasets for your use-case.

With these in mind, let’s see how the generate command works!

Previewing Chat Completions

The required fields are --project and --name. For project, we’ll use the same value we entered into the initialization command my_project_name and name will represent the name of the dataset. We can preview the chat completion prompts by running the following command which will give you a preview of each chat completion request payload.

oaift generate --project my_project_name --name template

This is a great way to confirm your configuration template string is being constructed correctly before running the API calls which does incur a cost. For our examples, we only expect the entire process to cost about $0.10 cents.

Sentiment Recognition Dataset Config

Since our dataset should focus around sentiment recognition, let’s replace the oaigen.config.json with the following values. This will create 2 chat completions per topic.

{
"system": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral.",
"topics": [
"Positive weather",
"Negative weather",
"Neutral weather",
"Positive experience",
"Negative experience",
"Neutral experience",
"Positive meal",
"Negative meal",
"Neutral meal",
"Positive idea",
"Negative idea",
"Neutral idea",
"Positive thoughts",
"Negative thoughts",
"Neutral thoughts",
"Positive family",
"Negative family",
"Neutral family",
"Positive game",
"Negative game",
"Neutral game",
"Positive game review",
"Negative game review",
"Neutral game review",
"Positive show review",
"Negative show review",
"Neutral show review",
"Positive movie review",
"Negative movie review",
"Neutral movie review",
"Positive restaurant review",
"Negative restaurant review",
"Neutral restaurant review"
],
"count": 2,
"variables": {
"audience": "user"
},
"model": "gpt-3.5-turbo",
"template": "Generate a conversation where the user talks about a {{topic}} and the assistant responds with one of these words: positive, negative, or neutral. Then call convertToTrainingData({ conversation: MESSAGES[] })"
}

Here’s the first conversation prompt. Not perfect, but enough for our demo!

{
"model": "gpt-3.5-turbo",
"stream": false,
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{
"role": "user",
"content": "Generate a conversation where the user talks about a Positive weather and the assistant responds with one of these words: positive, negative, or neutral. Then call convertToTrainingData({ conversation: MESSAGES[] })"
}
]
}

Once we’ve confirmed that these prompt requests look good, then we can run the same command with the --apply flag to make the requests.

Viewing the Generated Dataset

As each chat completion request is resolved, you’ll notice that the file ./my_project_name/datasets/template/chat_completions.jsonl is continually being appended to. Upon completion, 2 new files will be created:

  • ./my_project_name/datasets/template/generated_report.json
  • ./my_project_name/datasets/template/training_set.jsonl

chat_completions.jsonl:

  • This file contains the raw response from our chat completion requests. It’s a great place to go in and inspect the results of our generation.

generated_report.json:

  • This file will contain a history of the oaigen.config.json that was used to generate the dataset as well as a summary of tokens used during the generation process. This can be used to calculate the estimated cost.

training_set.jsonl:

  • This file contains the parsed data if a function_call called convertToTrainingData is the response from the chat completion assistant. Any requests without the function call will be ignored as it is required to ensure the training set is in the correct format for fine tuning.

Here’s an example of one chat completion response in chat_completions.jsonl:

{
"id": "chatcmpl-8Db6ADlUzzU4cHl3B52wAF6dgeJHc",
"object": "chat.completion",
"created": 1698251562,
"model": "gpt-3.5-turbo-0613",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"function_call": {
"name": "convertToTrainingData",
"arguments": "{\n \"conversation\": [\n {\n \"role\": \"user\",\n \"content\": \"The weather is so beautiful today! The sun is shining and there's a gentle breeze.\"\n },\n {\n \"role\": \"assistant\",\n \"content\": \"Positive\"\n }\n ]\n}"
}
},
"finish_reason": "function_call"
}
],
"usage": {
"prompt_tokens": 131,
"completion_tokens": 69,
"total_tokens": 200
}
}

And its counterpart item in the training_set.jsonl file:

{
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{
"role": "user",
"content": "The weather is so beautiful today! The sun is shining and there's a gentle breeze."
},
{ "role": "assistant", "content": "Positive" }
]
}

Verifying the Generated Dataset:

Fine tuning depends heavily on the quality of your data. Since the generated dataset is using GPT, we should expect that it is not fully accurate. Therefore, it is crucial that you inspect the dataset and confirm everything looks correct and make any adjustments where necessary.

For example, in our dataset we notice that the assistant responds differently between conversations:

Good data — user message is positive and assistant responded with “Positive”

{
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{
"role": "user",
"content": "I just read a really positive movie review!"
},
{ "role": "assistant", "content": "Positive" }
]
}

Bad data — user message is negative but assistant responded with lowercased “negative”

{
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{
"role": "user",
"content": "I had a terrible experience at a restaurant last night."
},
{ "role": "assistant", "content": "negative" }
]
}

Bad data — assistant responds with a phrase or continued conversation

{
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{
"role": "user",
"content": "I just read a really positive restaurant review!"
},
{
"role": "assistant",
"content": "That's great! What did the review say?"
},
{
"role": "user",
"content": "The review said that the food was delicious and the service was excellent."
},
{ "role": "assistant", "content": "Sounds like a positive review!" }
]
}

Bad data — assistant incorrectly thinks the phrase is positive

{
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{ "role": "user", "content": "The weather today is neither hot nor cold." },
{ "role": "assistant", "content": "Positive" }
]
}

Bad data — duplicates

{"messages":[{"role":"system","content":"Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."},{"role":"user","content":"I had the most delicious meal today!"},{"role":"assistant","content":"Positive"}]}
{"messages":[{"role":"system","content":"Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."},{"role":"user","content":"I had the most delicious meal today!"},{"role":"assistant","content":"Positive"}]}

This will require some manual work to fix the dataset. Therefore, it is recommended that you use ChatGPT to experiment with prompts first to increase the accuracy of the data you want before using this tool. Sometimes using the right initial prompt and/or GPT-4 model will yield better results. Here are the ones created and cleaned with this demo for you to inspect if you are interested.

Once you have cleaned up the dataset, you’ll be able to start the fine tuning job. For the purpose of this article, we will not cover the fine tuning process, but you can check out my article on fine tuning for more details!

Summary

You don’t have to be an amazing AI expert to start learning how to fine tune models using OpenAI. Start practicing to recognize differences between good and bad data by using this tool to generate datasets for your use-cases. The more you experiment, the better you’ll be for real-life use cases.

You can checkout the full CLI project for oaift-nodejs and the example we created here.

See if you can improve this process or adjust it for your own needs by updating the generation prompt template in the oaigen.config.json or experimenting on ChatGPT. More experimentation will yield better results as you get a better understanding of how you can utilize various prompts to get closer to what you want to achieve. In the process, you might even discover that fine-tuning is not necessary for your use-cases while identifying areas that fine-tuning would yield great benefits.

Until next time!

--

--

Elbert Bae
BertCode

Learning enthusiast having fun with web development through mini-projects. On the side, writing about personal growth, life, and relationships.