Behind the Hype: Models based on T5 (2019) Still Better than Vicuna, Alpaca, MPT, and Dolly
A new study shows that there hasn’t been much progress behind the recent surge of chat models.
A research team from Alibaba and Singapore University has recently released a new leaderboard for instruction-tuned large language models (LLMs):
- Leaderboard
- Scientific paper: INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Chia et al., 2023)
- GitHub
All the chat models recently released belong to this class of models: Vicuna, Alpaca, Dolly, and ChatGPT.
The results on benchmarks for “problem solving” are very interesting:
ChatGPT is the best on average. But if you look at the 3rd rank, you’ll see “Flan-T5”. A base model (T5) that was released in 2019 and fine-tuned with instructions to become Flan-T5.
Flan-T5 outperforms all the LLaMa and OPT-based models which are billion-parameters bigger.
This is the first time we see this because chat models that are recently published are usually only compared to other recent ones, e.g., Vicuna versus Alpaca.