How to Fine Tune GPT 3.5 Turbo

Step by step guide on how to fine tune GPT-3.5 Turbo

Syed Muhammad Ali Musa Raza
8 min readSep 12, 2023
Logo

Fine Tuning:

Fine-tuning refers to the process of taking a pre-trained model (a model trained on a large dataset) and adapting it to a new, but related, task. Instead of training a model from scratch, which can be time-consuming and require a lot of data, fine-tuning leverages the knowledge captured in the pre-trained model to achieve better performance on the new task with potentially less data.

Use cases:

  1. Setting the style, tone, format, or other qualitative aspects
  2. Improving reliability at producing a desired output
  3. Correcting failures to follow complex prompts
  4. Handling many edge cases in specific ways
  5. Performing a new skill or task that’s hard to articulate in a prompt

Preparing your dataset:

When you’ve concluded that fine-tuning is the optimal approach (after maximizing the effectiveness of your prompt and pinpointing persistent model issues), it’s time to get your data ready for training. Craft a varied collection of sample conversations that closely mirror the interactions you expect the model to handle during real-time usage in a production environment. Keeping in mind the lowest possible dataset size should be 10, but a size of 100 is a decent size to fine tune your dataset.

Generating your OpenAI API Key:

once you have made your OpenAI account, then you need to generate your OpenAI API Key. So, for this go to your account photo which is on the top right corner of your screen. Then select View API Keys. It will take you to the API keys section. Click on the create new secret key button to create a new secret key. Then name your secret key.

After this click on the create secret key button on the bottom left of the tab. It will then give you your key and once it displays it to you, make sure to store it on your device in a word file or any other file you would like to use because it will not show it to you again for security purposes.

Adding credit to your OpenAI account:

You will need to add some credit into your account for fine tuning, so for this go to Manage account, click on Billing on the left side of your screen and then add click on add Payment method. Once you have entered your credit card or any other payment method, then you will have to enter the credit balance. For pricing of the fine tuning of different OpenAI models you can go to the pricing page by clicking on Pricing which is just below the payment methods.

Installing required libraries:

First you need to install all these libraries in your Windows PowerShell.

  1. pip install openai
  2. pip install requests
  3. pip install tiktoken
  4. pip install -U tokenizers

Making connection with the OpenAI API using your API key:

Use the following code is used to make connection with the OpenAI and the output shows some data to check for its working.

#Importing the OpenAi library
import openai
import requests

url = "https://api.openai.com/v1/chat/completions"

YOUR_OPENAI_API_KEY = "YOUR_API_KEY_HERE" # Replace with your actual API key

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer YOUR_API_KEY"
}

data = {
"model": "gpt-3.5-turbo",
"messages": [{"role": "user", "content": "Say this is a test!"}],
"temperature": 0.7
}

response = requests.post(url, headers=headers, json=data)

print(response.json())
output

Checking the dataset:

The following code is to check and load your dataset. It will display the head of the dataset.

import pandas as pd
import json

# Load the JSON file into a variable
with open('PATH_OF_YOUR_DATASET', 'r') as file:
dataset = json.load(file)

# Convert the JSON data to a pandas DataFrame for easier manipulation (optional)
df = pd.DataFrame(dataset)

# Display the first few rows of the DataFrame to check the data
df.head()

Following are the analysis you need to know about your dataset before fine tuning it:

Checking the number of examples in the dataset:

To check the number of examples in your dataset and the first example of the dataset, use the following code:

import json
from collections import defaultdict

print("Num examples:", len(dataset))
print("First example:")

first_example = dataset[0]
print("Prompt:", first_example["prompt"])
print("Completion:", first_example["completion"])

Checking for Errors in the dataset:

Using this code you can check the dataset for errors before fine tuning it:

from collections import defaultdict

# Format error checks for the provided dataset structure
format_errors = defaultdict(int)

for ex in dataset: # Adjusted to access the first element of dataset
if not isinstance(ex, dict):
format_errors["data_type"] += 1
continue

# Check for existence of "prompt" and "completion" keys
if "prompt" not in ex or "completion" not in ex:
format_errors["missing_key"] += 1
continue

# Check if "prompt" and "completion" are non-empty strings
if not ex["prompt"] or not isinstance(ex["prompt"], str):
format_errors["invalid_prompt"] += 1

if not ex["completion"] or not isinstance(ex["completion"], str):
format_errors["invalid_completion"] += 1

# Displaying the format errors
if format_errors:
print("Found errors:")
for k, v in format_errors.items():
print(f"{k}: {v}")
else:
print("No errors found")

Distribution Calculations for every key:

The following code will calculate the distributions for every prompt and completion keys as :

import tiktoken
import numpy as np

# Load the encoding (adjust the model name if needed)
encoding = tiktoken.get_encoding("cl100k_base")

def num_tokens_from_prompt_and_completion(data_entry):
num_tokens = 0
# Counting tokens for the "prompt"
num_tokens += len(encoding.encode(data_entry["prompt"]))
# Counting tokens for the "completion"
num_tokens += len(encoding.encode(data_entry["completion"]))
return num_tokens

def num_completion_tokens_from_data_entry(data_entry):
return len(encoding.encode(data_entry["completion"]))

# Use the dataset to compute token counts for each entry
prompt_and_completion_token_counts = [num_tokens_from_prompt_and_completion(entry) for entry in dataset]
completion_token_counts = [num_completion_tokens_from_data_entry(entry) for entry in dataset]

# Print the distribution for the token counts
print_distribution(prompt_and_completion_token_counts, "prompt and completion tokens")
print_distribution(completion_token_counts, "completion tokens")
output

Distributions for every prompts and completions:

The following code will count the number of tokens for every prompt and will give you the minimum and maximum number of tokens of prompts and completions and their combined number of tokens too along with their means and medians for distributions. As the limit for fine tuning a prompt completion pair is 4096 tokens so it will also tell you about the number of examples that are above that limit or else it will not fine tune any amount of tokens above that.

# Token counts
prompt_lens = []
completion_lens = []
total_lens = []

for ex in dataset:
prompt_len = len(encoding.encode(ex["prompt"]))
completion_len = len(encoding.encode(ex["completion"]))

prompt_lens.append(prompt_len)
completion_lens.append(completion_len)
total_lens.append(prompt_len + completion_len)

print_distribution(prompt_lens, "num_tokens_per_prompt")
print_distribution(completion_lens, "num_tokens_per_completion")
print_distribution(total_lens, "num_total_tokens_per_example")

# Check for examples that may exceed the 4096 token limit
n_too_long = sum(l > 4096 for l in total_lens)
print(f"\n{n_too_long} examples may be over the 4096 token limit, they will be truncated during fine-tuning")
output

Token Count and Pricing:

The following code will give you the total token count of your dataset and its pricing for fine tuning it as:

# Constants for Pricing and default n_epochs estimate
MAX_TOKENS_PER_EXAMPLE = 4096

TARGET_EPOCHS = 3
MIN_TARGET_EXAMPLES = 100
MAX_TARGET_EXAMPLES = 25000
MIN_DEFAULT_EPOCHS = 1
MAX_DEFAULT_EPOCHS = 25

# Adjustments based on the constants
n_epochs = TARGET_EPOCHS
n_train_examples = len(dataset)
if n_train_examples * TARGET_EPOCHS < MIN_TARGET_EXAMPLES:
n_epochs = min(MAX_DEFAULT_EPOCHS, MIN_TARGET_EXAMPLES // n_train_examples)
elif n_train_examples * TARGET_EPOCHS > MAX_TARGET_EXAMPLES:
n_epochs = max(MIN_DEFAULT_EPOCHS, MAX_TARGET_EXAMPLES // n_train_examples)

# Replacing convo_lens with total_lens for token counts
n_billing_tokens_in_dataset = sum(min(MAX_TOKENS_PER_EXAMPLE, length) for length in total_lens)
print(f"Dataset has ~{n_billing_tokens_in_dataset} tokens that will be charged for during training")
print(f"By default, you'll train for {n_epochs} epochs on this dataset")
print(f"By default, you'll be charged for ~{n_epochs * n_billing_tokens_in_dataset} tokens")

OpenAI dataset format for fine tuning:

Normally the format of the dataset is in JSON, for example:

[ { "prompt": "What's the capital of France?"},
{ "completion": "Paris, as if everyone doesn't know that already."} ]

The format your dataset should be in is JSONL format. The following examples shows just that:

{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, 
{"role": "user", "content": "What's the capital of France?"},
{"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}

Most datasets are in the JSON format so for to convert them into the JSONL format the following code converts JSON to JSONL:

# Convert JSON to JSONL
with open("dataset_for_finetuning.json", "r") as json_file:
data = json.load(json_file)
jsonl_filename = "dataset_for_finetuning.jsonl"
with open(jsonl_filename, "w") as jsonl_file:
for entry in data:
jsonl_file.write(json.dumps(entry) + "\n")
jsonl_filename

Creating a fine tuning job:

Using the following code you can fine tune your dataset:

import os
import openai
import time

# Set your OpenAI API key
openai.api_key = "YOUR_API_KEY"

# Change the filename to match the name of your dataset file
file_name = "dataset_for_finetuning.jsonl"

file_upload = openai.File.create(file=open(file_name, "rb"), purpose="fine-tune")
print("Uploaded file id", file_upload.id)

while True:
print("Waiting for file to process...")
file_handle = openai.File.retrieve(id=file_upload.id)
if len(file_handle) and file_handle.status == "processed":
print("File processed")
break
time.sleep(3)

# Start the fine-tuning job
job = openai.FineTuningJob.create(training_file=file_upload.id, model="gpt-3.5-turbo")

while True:
print("Waiting for fine-tuning to complete...")
job_handle = openai.FineTuningJob.retrieve(id=job.id)
if job_handle.status == "succeeded":
print("Fine-tuning complete")
print("Fine-tuned model info", job_handle)
print("Model id", job_handle.fine_tuned_model)
break
time.sleep(3)

It will take some time to fine tune, depending on your device’s computing power. Output will be like this for some time:

output

Once the fine tuning is finished, it will give you the details about the fine tuned model in the output as:

To use your fine tuned model go to the OpenAI playground and then select your fine tuned model to use it:

Then you can enter your prompt in the user section and the model will respond to it according to your fine tuned model.

For more information go to OpenAI documentation to learn more about it and the steps to fine tune your model.

--

--