OpenAI API: Fine-Tuned GPT-3.5 vs. Base GPT-3.5

10 min readAug 27, 2023

A few months ago, I wrote about my journey comparing the output quality, costs, and latency of OpenAI’s chat completion APIs with fine-tuned models. My use case was integrating AI into Flyde, a visual programming tool I’m working on. I’ve written in detail about the process and results.

The first part is available here and contains a detailed background story and process. Spoiler alert: GPT 3.5 turbo was the best-performing model while weighing out cost, latency, and quality factors.

At the time the first article was published, only the lower-tier GPT 3 models (ada, babbage, curie, and davinci) could be fine-tuned. However, on August 22, 2023, OpenAI announced GPT 3.5 turbo can now be fine-tuned as well.

AI Startups queueing up to use GPT-3.5 Fine-tuning: an illustration. Photo by Levi Jones on Unsplash

This leaves me no choice but to take it for a spin and see how well the fine-tuned model compares to the non-fine-tuned model.

I’m sure it’ll outperform the base GPT-3.5 version, but I’m excited to see how much!

Quick Recap

My initial goal was to use OpenAI to generate “code nodes” for Flyde, based on the user’s prompt. Think of it as a function that adheres to a certain format, that later can be used in a visual programming environment.
The training data consisted of the existing 150+ Flyde standard library of code nodes, along with some GPT-4-based synthesized data. GPT-4 was also used to determine the quality score of the results.

Plan

Armed with the code from the previous attempt (view it here), I start by going through the new fine-tuning guide to get a sense of the process.

The first step is to gather training data, but it’s a bit different this time. Unlike the “Completion” APIs that just need a prompt and an output, the Chat Completion API requires a full conversation as input. This is the same kind of input you’d use with the API itself.

Another important detail I noticed is that the recommended dataset size is now 50 to 100 training examples. That’s much lower than the 500+ required for the previous generation. This means the dataset for fine-tuning is now simpler and easier to create!

The second step will be starting the fine-tuning job and waiting for it to complete. According to OpenAI, it can take minutes or hours depending on the model and dataset size.

The third and last step will be to test the fine-tuned model using the 17 pre-defined prompts from the previous project and compare it with the previous results.

Time to get our hands dirty.

Dataset preparation

The previous code for dataset preparation needed some tweaks to be adapted to the chat-completion API:

Now, fewer examples are needed. Previously, I used GPT-4 and made two variations of each node, matched with four different prompts. That was a bit of a mess. Now, I can stick with just the original node and one prompt.
No need for special separator characters to distinguish between the prompt and the completion.
A system role will be added to all examples. In the last version, I’ve compared 3 versions of system roles (aka “prompts”) varying in length.
For now, I will use the longest and most detailed version, as in the training phase we optimize for quality and not for costs and latency
There is no need or support for built-in “validation” examples. This means I can use my 200 examples as training data.

Here’s the new script, and here’s the diff from the previous one.

The next step is to convert the resulting JSON into a JSONL file, the format used by OpenAI. It is a JSON array where each line is an element. I noticed they removed any hints regarding how to create the JSONL file and are no longer suggesting their CLI tool. Luckily, I found an online tool that does the job just file.

Now that I had my JSONL file ready, I moved forward to ensure the file was valid using OpenAI’s suggested script — https://platform.openai.com/docs/guides/fine-tuning/check-data-formatting.

This is the output of running it with my dataset:

Num examples missing system message: 0
Num examples missing user message: 0

#### Distribution of num_messages_per_example:
min / max: 3, 3
mean / median: 3.0, 3.0
p5 / p95: 3.0, 3.0

#### Distribution of num_total_tokens_per_example:
min / max: 1052, 1291
mean / median: 1084.8, 1070.0
p5 / p95: 1058.0, 1127.0

#### Distribution of num_assistant_tokens_per_example:
min / max: 19, 255
mean / median: 50.29333333333334, 36.0
p5 / p95: 24.0, 87.39999999999998

0 examples may be over the 4096 token limit, they will be truncated during fine-tuning
Dataset has ~162720 tokens that will be charged for during training
By default, you'll train for 3 epochs on this dataset
By default, you'll be charged for ~488160 tokens
See pricing page to estimate total costs

It looks good, and will cost only $3.9, based on the price at the time of writing — $0.0080 / 1K tokens.

Creating the fine-tuned model

This should be pretty straightforward. One difference is that the creation process is done purely programmatically, unlike legacy fine-tuning which supports using their CLI tool.

Note: If you’re also using the TypeScript SDK, make sure to update it to version 4 to gain access to the new fine-tuning API.

First, I had to upload the JSONL file:

import fs from "fs";
import OpenAI from "openai";

const file = await openai.files.create({
  file: fs.createReadStream("dataset-cc.jsonl"),
  purpose: "fine-tune",
});

console.log("File:", file);

Then, using the file ID from the previous step, I started the fine-tuning job:

const fineTune = await openai.fineTuning.jobs.create({
  training_file: "file-e2S8ILDEdmthfF8TZ5tPh5WR",
  model: "gpt-3.5-turbo",
  suffix: "flyde-23-08-27",
});
console.log("Fine-tune:", fineTune);

Heads up: I initially ran the script all at once and it failed. Why? OpenAI was still processing the file. Sure, you could poll the status and proceed, but that’s overkill for this project.

This was the response:

{
  object: 'fine_tuning.job',
  id: 'ftjob-qDUjcaL5jei4j1VnyUKbY6Kl',
  model: 'gpt-3.5-turbo-0613',
  created_at: 1693138636,
  finished_at: null,
  fine_tuned_model: null,
  organization_id: 'org-1l7LD0Bz2S293jhft0POo9FM',
  result_files: [],
  status: 'created',
  validation_file: null,
  training_file: 'file-e2S8ILDEdmthfF8TZ5tPh5WR',
  hyperparameters: { n_epochs: 3 },
  trained_tokens: null
}

Being both lazy and eager, I wrote a little script that fetches the job each minute and notifies me (using node-notifier ) if something changes:

let lastData = "";
while (true) {
  const job = await openai.fineTuning.jobs.retrieve(
    "ftjob-qDUjcaL5jei4j1VnyUKbY6Kl"
  );
  const strJob = JSON.stringify(job);
  if (lastData !== strJob) {
    console.log(job);
    notifier.notify({ message: "Something changed in the fine-tune job!" });
  }
  lastData = strJob;
  await new Promise((resolve) => setTimeout(resolve, 60000));
}

And guess what? Exactly 17 minutes later (!!), a wild notification appeared on my desktop. I checked the terminal and saw it succeeded.

I quickly rushed to write a small script to test the fine-tuned model:

const completion = await openai.chat.completions.create({
  messages: [
    { role: "user", content: "create a node that adds 2 numbers" },
  ],
  model: "ft:gpt-3.5-turbo-0613:personal:flyde-23-08-27:7s9Gy7SR",
});

Only to receive a very generic response that has nothing to do with Flyde.

I guess I was too optimistic thinking fine-tuning reduces the need for a “system” role. I tried again with a very short system role:

[
   { role: "system", content: "you create flyde code nodes" },
   { role: "user", content: "create a node add 2 numbers" }
]

And also nothing, very generic. Not good.

My hopes of not needing a long system role and reducing code generation latency for Flyde drop sharply.

Me, after trying the fine-tuned model: an illustration. Photo by Arash Payam on Unsplash

I decided to continue with the original plan, In the last attempt, I used 3 variations of system roles for testing GPT3.5 and GPT4. To truly test the effect of the fine-tuning process, I will add 3 new, shorter variations.

This will give a glimpse of how short can the system-role be without hurting quality.

Next, I generated the benchmark data and ran the GPT-4-based “judge” on them, and I had a CSV full of interesting data ready! A friendly reminder that the previous post describes this process in detail.

Some insights:

Before trying to figure out a clear result, here are some patterns and insights I’ve noticed:

GPT3.5 became much faster
The first thing I observe is totally unrelated to fine-tuning — the chat completion API is muuuch faster. Around %300 faster:

S, M, and L are the sizes of the prompt

Fine-tuned GPT3.5 is faster than the base one

Data is the average time it took to complete, in seconds.

The first row taking longer must have been a network hiccup.

Fine-tuned GPT-3.5 is expensive
The second thing worth mentioning is that using fine-tuned GPT3.5 models is around ±10x more expensive than using the base model: base GPT3.5 is $0.0015 and $0.002 per 1K of input/output tokens (respectively). while using a fine-tuned GPT3.5 model is $0.012/$0.016.

Here are the price differences for our dataset (in cents).

Using a fine-tuned model was roughly 7 times more expensive than the base one.

Fine-tuning !== no system role
One huge misunderstanding I had when starting this article was that I got to skip the system role when using fine-tuned models. Well, this benchmark shows it’s not the case.
While the fine-tuned model will perform much better with fewer instructions, you can’t skip important details. In my case, using a prompt with 0 code examples failed to generate proper code.
However, when given just a few examples and minimal instructions, the fine-tuned model scored 20% better than its counterpart.

On the other hand, when the system role content size increased, the difference in quality became insignificant:

Score from 1–5 as generated by a GPT-4 based judge

Conclusion

A quick reminder — 6 prompt variations were tested against a fine-tuned GPT-3.5 model and the base GPT 3.5 model. The prompts varied in instruction depth and example count. The full prompts are available on GitHub. Each model+prompt combination was used to generate 17 Flyde code nodes. A GPT-4-based “judge” was used to score the quality of the output.

Following the last insight from the previous section, the most minimal version of the prompt (S0) failed to produce a single working node. Therefore, it was removed from the candidates.

Here are the winners in each category:

Latency
All candidates were in the 1.5–3s average time to complete, and 2s-3.7s p80, and as mentioned above, all fine-tuned models were 50%+ faster than their base versions.
Winner: FT CC 3.5: S with 2s on avg and 1.98 p80.

Costs
The cost to generate a node ranged from 0.07¢ to 1.54¢ (or $0.0007 to $0.0154 if you like zeros). As mentioned above, fine-tuned models are more expensive.

Winner: CC 3.5: S2 with an avg. cost of 0.07 cents per node. That’s more than 1000 nodes a dollar!

Score
Surprisingly, the shorter prompts performed best here. Both the fine-tuned and the base model reached an average of 4.76/5 with a short prompt accompanied by the full list of examples. I used the p10 metric to break the tie. The fine-tuned model aced the 10th percentile as well, with a whooping 5/5, while the base model reached 4/5.
Winner: FT CC 3.5: S

Overall Best Performer

Again, OpenAI managed to surprise me. Fine-tuned models showed better latency and slightly better quality. But it came with a much higher price tag.
So, bottom line, fine-tuned models aren’t a fit for my use case.

The real surprise was that since I’ve last benchmarked GPT3.5, it become faster, cheaper, and produces better results. On the last attempt, using the same test data, the short prompt version (CC 3.5: S ) scored poorly on quality. Only 2.35/5. On this attempt, it reached the top of the leaderboard!

So, full of mystery and awe for the inner workings of LLMs in general, and the magical labs of OpenAI in particular, I am happy to declare the winner:

Overall winner: Base model GPT 3.5 with a short prompt, aka CC 3.5: S !

“CC 3.5: S”, an illustration. Photo by Museums Victoria on Unsplash

It produces the best quality results, is relatively fast and cheap, it is a no-brainer pick for Flyde’s AI feature!

The full results can be viewed here.

Summary

That was an interesting ride. I admit I was surprised by the results. I truly hoped fine-tuning would pose a better overall alternative. The main surprise though, was the pace that OpenAI improved the latency of the existing models while reducing the costs.

And don’t get me wrong — I’m pretty sure there are plenty of use cases where fine-tuning GPT-3.5 is the ideal choice. For example, cases where the 4K (or even 16K) token limit isn’t enough to train the model properly, or where 1–2 seconds less in latency are critical.

I also am sure there are dozens of things I could have done better in the process that would have helped fine-tuning shine brighter. If you have any suggestions, please let me know in the comments!

Thanks a lot for reading, and stay tuned as I promise to continue my journey on the next OpenAI advancement.

If you found this article useful, please consider giving us a star on our GitHub repository. Your support really makes a difference. ⭐️.

Have questions or insights to share? Leave a comment below. I value your feedback and it could influence future articles.