How fine-tuned LLMs compete with commercial models ?

Published in

Sarus Blog

3 min readJan 23, 2024

In a previous post, we explored how fine-tuning a large language model (LLM) model compares with using a commercial service like OpenAI’s ChatGPT. We considered various factors such as cost and privacy. In this post, we will examine more closely the LLMs’ performance on specific tasks. We first focus on narrow tasks like text classification and then on open questions answering. In this article, you will hopefully gain insights of how well a fine-tuned open-source LLM performs compared to a big generalist model like GPT-4.

Text classification

One of the use case of LLM is text classification. Large Language Models build powerful representations of texts to be able to guess the next words from a text. These representations carry rich meaning making them good inputs for a text classifier. Alternatively, one can directly ask the LLM to predict the category name. This approach has the advantage of preserving the model’s architecture.

We conduct a small experiment to estimate how good a fine-tuned LLM can be compared to ChatGPT. We use medical text classification dataset of medical records. Many businesses use LLMs on their data which might be very domain specific. We aim to replicate this situation here.

For each medical record, we ask our model to generate the correct category. Since LLMs can produce any text, the models will sometimes produce invalid answers. In the results below, we show the percentage of valid answers. Invalid answers are discarded and the model’s accuracy is computed from valid answers only.

+--------------------------------------+---------------+----------------+
|                Model                 | Valid answers | Valid accuracy |
+--------------------------------------+---------------+----------------+
|      GPT 3.5 fine tuned 1 epoch      |     100.0%    |     60.0%      |
|    Llama 2 7b fine tuned 1 epoch     |     98.0%     |     59.0%      |
|    Llama 2 13b fine tuned 1 epoch    |     98.0%     |     59.0%      |
|           GPT 3.5 (1 shot)           |     97.0%     |     50.0%      |
|         GPT 4 turbo (1 shot)         |     97.0%     |     49.0%      |
|         GPT 4 turbo (0 shot)         |     98.0%     |     43.0%      |
|          Vicuna 7b (0 shot)          |     79.0%     |     42.0%      |
|           GPT 3.5 (0 shot)           |     95.0%     |     40.0%      |
|         Llama 2 13b (0 shot)         |     70.0%     |     33.0%      |
+--------------------------------------+---------------+----------------+

As we can see, big commercial models are not better than a 7 billion parameter fine-tuned model in this specialized narrow task. Without fine-tuning, both GPT 4 and GPT 3.5 stall at around 50% accuracy. Fine-tuning GPT 3.5 boost the accuracy to around 60%. On the other hand, a fine-tuned Llama 2 with 7 billion parameters can also achieve a 59% accuracy.

Open questions answering

Next, we experiment with open-ended question answering. We make the models generate answers to the Medical QnA dataset. To measure the models’ ability, we ask GPT 4 to note the answer using the LLM as a judge method.

+-------------------------------+--------------------+
|             Model             | Mean note by GPT-4 |
+-------------------------------+--------------------+
|          GPT 4 turbo          |     8.87 / 10      |
|         GPT 3.5 turbo         |     7.97 / 10      |
|            dataset            |     6.82 / 10      |
|   GPT 3.5 fine tuned 1 epoch  |     6.28 / 10      |
| Llama 2 7b fine tuned 1 epoch |     4.46 / 10      |
|           Llama 2 7b          |     2.63 / 10      |
|          Llama 2 13b          |     2.23 / 10      |
+-------------------------------+--------------------+

As could be expected, GPT 4 excels at answering open questions. Answers given by GPT 4 and GPT 3.5 are rated even higher than the dataset’s own responses. A fine-tuned model has little hope to perform better that the dataset baseline. As a matter of fact, fine-tuning GPT 3.5 on the dataset degrades its performance. Nonetheless, fine-tuning Llama 2 for 1 epoch boosts its average rating from 2.63 to 4.46.

Conclusion

Comparing commercial models’ and fine-tuned open source models’ performance greatly depends on the task at hand. Using a fine-tuned open-source model can be great if the task at hand is specialized and narrow. The more complex the task, the better big commercial models will be comparatively. On the other hand, if you need to generate small texts of specialized data, a fine-tuned open source model can be a cheaper alternative.

How fine-tuned LLMs compete with commercial models ?

Text classification

Open questions answering

Conclusion

Written by Johan Leduc