Behind the Hype: Models based on T5 (2019) Still Better than Vicuna, Alpaca, MPT, and Dolly

A new study shows that there hasn’t been much progress behind the recent surge of chat models.

2 min readJun 14, 2023

A research team from Alibaba and Singapore University has recently released a new leaderboard for instruction-tuned large language models (LLMs):

Leaderboard
Scientific paper: INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Chia et al., 2023)
GitHub

All the chat models recently released belong to this class of models: Vicuna, Alpaca, Dolly, and ChatGPT.

The results on benchmarks for “problem solving” are very interesting:

Source: https://declare-lab.net/instruct-eval/ (June 14th, 2023)

ChatGPT is the best on average. But if you look at the 3rd rank, you’ll see “Flan-T5”. A base model (T5) that was released in 2019 and fine-tuned with instructions to become Flan-T5.

Flan-T5 outperforms all the LLaMa and OPT-based models which are billion-parameters bigger.

This is the first time we see this because chat models that are recently published are usually only compared to other recent ones, e.g., Vicuna versus Alpaca.

Behind the Hype: Models based on T5 (2019) Still Better than Vicuna, Alpaca, MPT, and Dolly

A new study shows that there hasn’t been much progress behind the recent surge of chat models.

Written by Benjamin Marie