Fine-tuning with OpenAI
I finally was able to test fine-tuning using OpenAI!! Fine-tuning using OpenAI’s gpt models (gpt-3.5-turbo-1106, gpt-3.5-turbo-0613, babbage-002, davinci-002, gpt-4–0613) consists of 4 steps:
- Make a training file (datafile.jsonl) using JSON lines notation: load the desired inputs/prompts (user_content) and outputs (assistant_content)
- Upload datafile.jsonl to OpenAI
- Fine-tune one of the existing OpenAI gpt models
- Use the fine-tuned model
Setup (Install and set OpenAI key)
!pip install openai
import os
from os import environ
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', 0)
# https://pypi.org/project/python-dotenv/
from dotenv import load_dotenv, find_dotenv
dot_env_file_exist = load_dotenv(find_dotenv()) # read local .env file
# Returns true or false if .env exists in current directory
print('dot_env_file_exist: ', dot_env_file_exist)
import openai
import json
# Save key to .env file
!dotenv set OPENAI_API_KEY PUT_KEY_HERE
# !dotenv get OPENAI_API_KEY
# PYTHON Solution: Read the key directly from the .env file
def python_get_dotenv_data(parms):
with open('.env', 'r') as reader:
out = reader.readlines()
for i in parms:
for j in range(len(out)):
ind = out[j].rfind(i)
if ind != -1:
st = ind+(2+len(i))
end = len(out[j]) - 2
globals()[f'{i}'] = out[j][st:end]
# Save key as a Python variable
parms = ['OPENAI_API_KEY']
python_get_dotenv_data(parms)
openai.api_key = OPENAI_API_KEY
Create a dataset
In a previous blog post, I made a chatbot that told users the opening hours for three store locations. Let’s test how many examples we need to pre-train a reliable chatbot that gives the desired information for store opening hours.
system_content = """You are a helpful store hour customer assistant. You are to tell people
when a store location opens for a particular day."""
prompt_Q = ["What time is location0 open on Monday?",
"What time is location0 open on Tuesday?",
"What time is location0 open on Wednesday?",
"What time is location0 open on Thursday?",
"What time is location0 open on Friday?",
"What time is location0 open on Saturday?",
"What time is location0 open on Sunday?",
"When is location0 open on Monday?",
"When is location0 open on Tuesday?",
"When is location0 open on Wednesday?",
"When is location0 open on Thursday?",
"When is location0 open on Friday?",
"When is location0 open on Saturday?",
"When is location0 open on Sunday?",
"What time is location1 open on Monday?",
"What time is location1 open on Tuesday?",
"What time is location1 open on Wednesday?",
"What time is location1 open on Thursday?",
"What time is location1 open on Friday?",
"What time is location1 open on Saturday?",
"What time is location1 open on Sunday?",
"What time is location2 open on Monday?",
"What time is location2 open on Tuesday?",
"What time is location2 open on Wednesday?",
"What time is location2 open on Thursday?",
"What time is location2 open on Friday?",
"What time is location2 open on Saturday?",
"What time is location2 open on Sunday?",
"Are you open on Monday?",
"Are you open on Tuesday?",
"Are you open on Wednesday?",
"Are you open on Thursday?",
"Are you open on Friday?",
"Are you open on Saturday?",
"Are you open on Sunday?",
"Are you open on Monday?",
"Are you open on Tuesday?",
"Are you open on Wednesday?",
"Are you open on Thursday?",
"Are you open on Friday?",
"Are you open on Saturday?",
"Are you open on Sunday?"]
len(prompt_Q)
assistant_content_A = ["Location0 opens at 9 am on Monday.",
"Location0 opens at 9 am on Tuesday.",
"Location0 opens at 9 am on Wednesday.",
"Location0 opens at 9 am on Thursday.",
"Location0 opens at 9 am on Friday.",
"Location0 opens at 9 am on Saturday.",
"Location0 opens at 9 am on Sunday.",
"Location0 opens at 9 am on Monday.",
"Location0 opens at 9 am on Tuesday.",
"Location0 opens at 9 am on Wednesday.",
"Location0 opens at 9 am on Thursday.",
"Location0 opens at 9 am on Friday.",
"Location0 opens at 9 am on Saturday.",
"Location0 opens at 9 am on Sunday.",
"Location1 opens at 8 am on Monday.",
"Location1 opens at 8 am on Tuesday.",
"Location1 opens at 8 am on Wednesday.",
"Location1 opens at 8 am on Thursday.",
"Location1 opens at 9 am on Friday.",
"Location1 opens at 12 midday on Saturday.",
"Location1 opens at 12 midday on Sunday.",
"Location2 opens at 8 am on Monday.",
"Location2 opens at 8 am on Tuesday.",
"Location2 opens at 8 am on Wednesday.",
"Location2 opens at 8 am on Thursday.",
"Location2 opens at 9 am on Friday.",
"Location2 opens at 10 am on Saturday.",
"Location2 opens at 10 am on Sunday.",
"On Monday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"On Tuesday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"On Wednesday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"On Thursday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"On Friday, location0, location1, and location2 open at 9 am, 9 am, and 9 am respectively.",
"On Saturday, location0, location1, and location2 open at 9 am, 12 midday, and 10 am respectively.",
"On Sunday, location0, location1, and location2 open at 9 am, 12 midday, and 10 am respectively.",
"Monday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"Tuesday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"Wednesday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"Thursday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively.",
"Friday, location0, location1, and location2 open at 9 am, 9 am, and 9 am respectively.",
"Saturday, location0, location1, and location2 open at 9 am, 12 midday, and 10 am respectively.",
"Sunday, location0, location1, and location2 open at 9 am, 12 midday, and 10 am respectively."]
len(assistant_content_A)
Step 1: Make the training file
A training file can be made using python, pandas, or bash. Making a JSON lines file with the OpenAI format was the trickiest part because OpenAI does not use true JSON notation, instead they require that one use a dictionary. True JSON notation is a dictionary wrapped in a string, or in other words a string dictionary. In addition, to requiring dictionary format, OpenAI format requires usage of double quotes for strings; python uses single quotes by default and this causes uploading to fail. In the examples below, I modify the datafile.jsonl using the sed bash command for both python and pandas formats, so that it satisfies the OpenAI format.
Python
# Way 0: using python ONLY
for i, desired_A in enumerate(assistant_content_A):
line_out = {"messages": [{"role": "system", "content": system_content},
{"role": "user", "content": prompt_Q[i]},
{"role": "assistant", "content": desired_A}]}
which_way = 'txt_save' # 'json_save'
# JSONL is a text-based format using the . jsonl file extension that is basically the same as
# JSON format but implemented using newline characters to separate JSON values. It is also
# known as JSON Lines.
if i == 0:
if which_way == 'json_save':
with open("datafile.jsonl", "w") as wf:
json.dump(str(line_out) + '\n', wf)
else:
# Save as text
fptr = open("datafile.jsonl", 'w', encoding='UTF8')
fptr.write(str(line_out) + '\n')
fptr.close()
else:
if which_way == 'json_save':
with open("datafile.jsonl", "a") as wf:
json.dump(str(line_out) + '\n', wf)
else:
# Save as text
fptr = open("datafile.jsonl", 'a', encoding='UTF8')
fptr.write(str(line_out) + '\n')
fptr.close()
!cat datafile.jsonl
# Way 0 part 1: Modify the JSON python file to match OpenAI fine-tuning format
# [0] remove ' and replace with "
!cat datafile.jsonl | sed "s/'/\"/g" > datafile_no_singlequotes.jsonl
# bash: view .jsonl file to confirm contents (results are clear)
!cat datafile_no_singlequotes.jsonl
# OR
# python: view .jsonl file to confirm contents
# with open('datafile_no_singlequotes.jsonl', 'r') as reader:
# print(reader.read())
# OR
# pandas: view .jsonl file to confirm contents
# df = pd.read_json('datafile_no_singlequotes.jsonl', lines=True)
# df
Pandas
# Way 1 part 0: using pandas ONLY
for i, desired_A in enumerate(assistant_content_A):
df_temp = pd.DataFrame([["system", system_content], ["user", prompt_Q[i]], ["assistant", desired_A]], columns=["role", "content"])
df_temp_json = df_temp.to_json(orient='records', lines=False)
df_temp_line = pd.Series(df_temp_json)
messages = pd.DataFrame([[df_temp_line[0]]], columns=["messages"])
if i == 0:
# This turns messages into json format, which is a string
messages.to_json('datafile.jsonl', orient='records', lines=True, compression='infer', mode='w')
else:
# This turns messages into json format, which is a string
messages.to_json('datafile.jsonl', orient='records', lines=True, compression='infer', mode='a')
!cat datafile.jsonl
Pandas creates files that contain a dictionary wrapped in a string (ie: {“messages”:”[{\”role\”: … ]”} instead of {“messages”: [{“role”: … ]}), so you need to remove the string notation and the backslash characters.
# Way 1 part 1: Modify the JSON pandas file to match OpenAI fine-tuning format
# [0] remove "[ and replace with <space>[, [1] remove ]" and replace with ], [2] remove \" and replace with "
!cat datafile.jsonl | sed 's/\"\[/ \[/g' | sed 's/\]\"/\]/g' | sed 's/\\"/\"/g' > datafile_nostring_around_data.jsonl
!cat datafile_nostring_around_data.jsonl
Step 2: fine-tune
# Define the client
from openai import OpenAI
client = OpenAI(api_key = OPENAI_API_KEY)
# Using python
client.files.create(
file = open("datafile_nostring_around_data.jsonl", "rb"),
purpose = "fine-tune"
)
outputs = FileObject(id=’file-XXXXXXXXXXXXXXXXXXXXXXXX’, bytes=13824, created_at=1701694679, filename=’datafile_nostring_around_data.jsonl’, object=’file’, purpose=’fine-tune’, status=’processed’, status_details=None)
Step 3: Create a fine-tune model with training file name
Use the FileObject id to identify the training file for training the fine-tune model.
response = client.fine_tuning.jobs.create(
training_file = 'file-XXXXXXXXXXXXXXXXXXXXXXXX',
model="gpt-3.5-turbo"
)
print(response)
response = FineTuningJob(id=’ftjob-YYYYYYYYYYYYYYYYYYYYYYYY, created_at=1701694743, error=None, fine_tuned_model=None, finished_at=None, hyperparameters=Hyperparameters(n_epochs=’auto’, batch_size=’auto’, learning_rate_multiplier=’auto’), model=’gpt-3.5-turbo-0613', object=’fine_tuning.job’, organization_id=’org-ZZZZZZZZZZZZZZZZZZZZZZZZ’, result_files=[], status=’validating_files’, trained_tokens=None, training_file=’file-XXXXXXXXXXXXXXXXXXXXXXXX’, validation_file=None)
Step 4: Use the fine-tuned model
List the fine-tuning jobs or go to https://platform.openai.com — fine-tuning, to obtain the model name. I trained two models, one with 28 examples and one with 42 examples.
# List fine-tuning jobs
client.fine_tuning.jobs.list(limit=5)
outputs = SyncCursorPage[FineTuningJob](data=[FineTuningJob(id=’ftjob-YYYYYYYYYYYYYYYYYYYYYYYY’, created_at=1701694743, error=None, fine_tuned_model=’ft:gpt-3.5-turbo-0613:personal::PPPPPPPP’, finished_at=1701695867, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model=’gpt-3.5-turbo-0613', object=’fine_tuning.job’, organization_id=’org-ZZZZZZZZZZZZZZZZZZZZZZZZ’, result_files=[‘file-RRRRRRRRRRRRRRRRRRRRRRRR’], status=’succeeded’, trained_tokens=7662, training_file=’file-XXXXXXXXXXXXXXXXXXXXXXXX’, validation_file=None), FineTuningJob(id=’ftjob-YYYYYYYYYYYYYYYYYYYYYYYY’, created_at=1701692879, error=None, fine_tuned_model=’ft:gpt-3.5-turbo-0613:personal::QQQQQQQQ’, finished_at=1701693238, hyperparameters=Hyperparameters(n_epochs=3, batch_size=1, learning_rate_multiplier=2), model=’gpt-3.5-turbo-0613', object=’fine_tuning.job’, organization_id=’org-ZZZZZZZZZZZZZZZZZZZZZZZZ’, result_files=[‘file-RRRRRRRRRRRRRRRRRRRRRRRR’], status=’succeeded’, trained_tokens=5199, training_file=’file-XXXXXXXXXXXXXXXXXXXXXXXX’, validation_file=None)], object=’list’, has_more=False)
# model="ft:gpt-3.5-turbo-0613:personal::QQQQQQQQ" # 28 examples
model = "ft:gpt-3.5-turbo-0613:personal::PPPPPPPP" # 42 examples
# With assistant content
system_content = "You are a helpful store hour customer assistant. You are to tell people when a store location opens for a particular day."
prompt = "Are you open on Monday?"
assistant_content = "On Monday, location0, location1, and location2 open at 9 am, 8 am, and 8 am respectively."
for i in range(20):
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": prompt},
{"role": "assistant", "content": assistant_content},
]
)
parse_out = response.choices[0].message.content
print(parse_out)
# Without assistant content
system_content = "You are a helpful store hour customer assistant. You are to tell people when a store location opens for a particular day."
prompt = "Are you open on Monday?"
for i in range(20):
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": system_content},
{"role": "user", "content": prompt},
]
)
parse_out = response.choices[0].message.content
print(parse_out)
Summary
The 42 example fine-tuned model gave the desired results consistently for a simple use-case. I selected 42 because I just tried to approximately double the 28 examples with the reasoning that more training data may give a more consistent result.
It appears that if the question format is consistent like, “What time is XXX open on YYY?” it needs 3 or more example of a unique XXX corresponding to a YYY; I gave 3 examples of XXX for each of the 7 YYY examples (21 examples total). For example, “What time is location2 open on Friday?” is given as a single example use-case, and it can respond correctly 20 times in a row as “Location2 opens at 9 am on Friday.”. Similarly, I needed 2 examples of the question format “Are you open on YYY?” for each YYY item, in order to obtain consistent results.
Therefore, it appears that each unique question type needs 2 or more repeated examples to achieve consistent and reliable results, for fine-tuning with the gpt-3.5-turbo-0613 model!
Happy Practicing! 👋
🎁 Donate: support the blog! | 💻 GitHub | 🔔 Subscribe
References
- OpenAI Fine-tuning: https://platform.openai.com/docs/guides/fine-tuning