Learn Fine Tuning: Making GPT-3.5 Better at Recognizing Sentiment
It is important to understand limitations of large language models (LLMs) and where fine tuning can create value for us as software developers. By experimenting in prompt engineering with ChatGPT, you can see what GPT models are not great at doing. For example, you’ll notice that the base model is not reliable for factual, mathematical, or analytical tasks due to their tendency to “hallucinate” information. This has led developers to create fine tuned models by training base models with curated data to meet their needs.
While this article will go through an example of how to create fine tuned models, it will not go through the hardest part in great detail; recognizing when fine tuning is applicable. For the purpose of this article, we will focus on one task to fine-tune our model with; recognizing sentiment in text.
The article uses the following program language and libraries:
- Install NodeJs
- Install OpenAI client (npm or yarn)
- Create OpenAI secret key
Sentiment Recognition
Recognizing whether a paragraph is positive, negative, or neutral can be useful in various industries ranging from data analytics, marketing, to customer support. Before we discuss fine tuning a model, lets see how good the GPT-3.5 base model is at sentiment recognition with a simple prompt:
Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral.
Positive message:
- Remember that each day is a new opportunity to embrace the beauty of life. You’ve got the strength and resilience to overcome any challenge that comes your way. Keep shining brightly and making the world a better place with your presence!
Negative message:
- I have to be honest, I’m just not enjoying this movie. The plot seems all over the place, and the characters are hard to relate to. I was really hoping for a great cinematic experience, but it’s just not living up to my expectations.
Neutral message:
- This is a cherry pie.
Try these out in ChatGPT and see what kind of responses you get. For me, I get messages like the following:
The message you provided sounds very [positive, negative, neutral]. [Some explanation].
For the purpose of our exercise, we want to have the model only return one word out of “Positive”, “Negative”, and “Neutral” without the explanations. Of course, we could adjust our prompt to achieve this, but let’s see if we can achieve it through fine tuning instead.
Training Dataset
The training dataset is probably the most important aspect of fine tuning a model. Fine tuning requires at least 10 example conversations and the process takes into account the whole conversation as context with the last assistant message to be the completion the model will attempt to learn. For the sentiment analysis, our conversations can be simple with one system message, user message, and assistant message.
Here’s an example of one conversation and the full set of used for training can be found here:
{
"messages": [
{
"role": "system",
"content": "Assistant takes inputs from users and determines if the message sounds positive, negative, or neutral."
},
{
"role": "user",
"content": "The weather today is absolutely beautiful! The sun is shining, and there's a gentle breeze."
},
{ "role": "assistant", "content": "Positive" }
]
}
The system message is always consistent while the user and assistant messages should be checked to ensure the conversation progresses as we expect it to. Once we’ve verified the dataset to be accurate (OpenAI guide with a script to verify datasets here), each conversation will be added to a JSONL file for us to upload and start the fine tuning job.
Uploading the Training File
OpenAI expects you to upload the dataset with the purpose tag of fine-tune
to their platform. This can be done by using the client to upload the file using the code below.
const client = new OpenAI({ apiKey: "Your Secret" })
client.files.create({
file: fs.createReadStream("./PATH_TO_DATASET.jsonl"),
purpose: "fine-tune",
}).then((data) => {
console.log(data)
})
Once the request is completed, the returned data will look like the following and we want to grab the file for the next step:
{
"object": "file",
"id": "file-UNIQUE_FILE_ID",
"purpose": "fine-tune",
"filename": "training_set.jsonl",
"bytes": 17962,
"created_at": 1697840660,
"status": "uploaded",
"status_details": null
}
Creating the Fine Tuning Job
Once your training file has been uploaded, you can create the fine tuning job. The suffix
will make it easier for us to identify the fine tuned model on the OpenAI platform.
client.fineTuning.jobs.create({
model: "gpt-3.5-turbo",
training_file: "file-UNIQUE_FILE_ID",
hyperparameters: {
n_epochs: "auto",
},
suffix: "sentiment"
}).then((data) => {
console.log(data)
});
Now, we just need to wait for the fine tuning to be completed which we can do by checking the events endpoint or on the OpenAI fine tuning dashboard.
Identifying the model:
Fine tuning events:
Comparing GPT-3.5 to the Fine Tuned Model
Anytime we fine tune a model, we should always verify that something meaningful has changed between the base and the new model. By going to OpenAI’s playground, we can test the differences between GPT-3.5 and our fine tuned model.
Remember to use the same system message as the one you’ve fined tuned on!
Below are examples of the differences you can see between the base model and the fine tuned model. As you can see, the negative messages see a big difference with the model returning a phrase whereas the fine tuned model returns Negative
. These results tested 10 times for the base model resulted in variety of responses between Negative
, negative
, and some variation of This message sounds negative
.
The neutral example was quite interesting. The base model repeatedly responded with Positive
(cherry pie is always positive, but the sentence isn’t the most expressive), whereas the fine tuned model marked it as Neutral
. Again, repeated about 10 times, the fine tuned model was consistent while the base model varied between Positive
, Neutral
, and some variation of This message is [Positive, Neutral]
.
Conclusion
Fine tuning a model can be incredibly useful if you identify the right use cases. There are factors to consider before jumping to fine tuning as the solution. Here are a few things you should consider before fine tuning for a project.
Can adjusting your prompts achieve the same thing? The following prompt makes it very consistent that the LLM only responds with 1 word.
Assistant takes inputs from users and determines the sentiment of the message and returns one of the following words: “Positive”, “Negative”, or “Neutral”.
Is GPT-3.5 (or even 4) already good at what you want it to achieve? For our case, we saw that it had trouble recognizing neutral statements. That alone may have be a valid reason to fine tune it further.
Is the cost worth it? Using the base model vs. fine tuned model is up to 7x more expensive plus the cost to fine tune the model. Can you achieve your goal with a longer prompt while still keeping the cost lower?
There are many other factors to consider before fine tuning a model and I have found that identifying good problems where fine tuning is valid has been the most difficult problem to solve. I encourage you to do a deeper dive into prompt engineering to identify when fine tuning will be beneficial which is something that I am still continuing to learn more about.
References: