David vs. Goliath in AI

Can Fine-Tuned Smaller Models like XLNet, DistilBERT and T5 Compete with Large Models for Classification Tasks?

Ian
Sarus Blog
Published in
4 min readJun 12, 2024

--

Introduction

Large Language Models (LLMs) have revolutionized NLP tasks, with tech giants developing trillion-parameter models that excel across many applications. However, their broad applicability can lead to underperformance for specific tasks under few-shot learning.

In contrast, smaller fine-tuned models have a much narrower scope than LLMs, and can outperform them on specific tasks. This report is interested in the performances of smaller models fine-tuned to a specific task versus large models with few shot learning. As an example, a DistilBERT model, fine-tuned for sentiment analysis of Twitter posts, can outperform LLMs like GPT-4.

Experiment Setup

To test this hypothesis, we conducted an experiment evaluating fine-tuned models for text classification. We compare three small models — XLNet, DistilBERT, and T5-small (117M, 67M, and 60.5M parameters, respectively) — across three datasets: Yelp-5, nuSentiment, and Medical-Transcript-40.

Using HuggingFace’s library, we selected models pre-trained for text-classification, and datasets with 3, 5, and 40 classes. Fine-tuning was straightforward, following standard training procedures. We used accuracy and AUROC as metrics, training models on both full and quarter-sized datasets. Claude-3-Opus was used for comparison, prompted from its API.

Results

Testing Metrics

The models trained on the nuSentiment dataset showed high testing accuracy, outperforming Claude-3 by a large margin. The AUROC scores for the three models were all above (or very nearly) 0.9, and the quarter-dataset trained models did not perform significantly worse than the fully trained ones.

For Yelp-5, models approached state-of-the-art accuracy (73%), although more training on the full dataset would have improved results. Still, Claude-3’s prompting performance was markedly lower.

As for Medical-Transcript-40, only XLNet was trained (on the entire dataset) because the others did not have a long enough max sequence length. It achieved a 72.4% accuracy and 0.97 AUROC score, significantly better than the 59% accuracy of Claude-3. XLNet’s high accuracy can be attributed to the elaborate medical text, which the model can infer lots of information from.

Training Times

Training Times

Models trained on nuSentiment and Medical-40 took under 4 hours, while models trained on Yelp-5 took much longer. Compared to nuSentiment, Yelp-5 dataset has 2.5x as many examples, at an average sequence length 4.5x longer, which accounts for some of the discrepancy.

NuSentiment dataset training times were reasonable, but the Yelp-5 training times were concerning, caused by the larger dataset size, longer sequence lengths, which caused us to use reduced batch sizes.

Cost analysis revealed that fine tuning plus inference of the smaller models was cheaper than using Claude-3’s API. For nuSentiment, fine-tuning our most expensive model costed ~€15. In general the VM we used is priced at €5/hr, and the minimum evaluation speed was 50,000 tokens/sec, equating to 0.03€/million tokens.

Meanwhile, Claud-3-Opus is priced at 7€/million tokens inputted.

Thus, it can be over 200 times cheaper to run inference on smaller models than to use LLMs’ API.

Conclusion

Our study suggests that fine-tuned smaller classification models are just as if not more effective, while also being more cost-efficient, than multi-purpose LLMs under few-shot learning. Results showed improved performance on nuSentiment, near state-of-the-art on Yelp-5, and substantially better accuracy on Medical-Transcript-40 datasets, when compared to Claude-3-Opus.

Generalizability should be assessed individually based on the task, but our findings reinforce the potential of using fine-tuned smaller models instead of LLMs for certain applications.

It should be noted that prompt engineering matters greatly, and it’s possible that Claude’s results could have been better. However a genuine amount of rigor was put into improving prompts and Claude results.

This post is one in a series of posts on AI and privacy. How to use AI and in particular commercial LLMs (for in-context learning, RAG or fine-tuning) with some privacy guarantees but also how AI and LLMs can help us solve privacy challenges. If you are interested in knowing more about existing AI with privacy solutions contact us and try our open-source framework: Arena (WIP).

See also:

--

--