Analysis on Italian LLMs models evaluations.

5 min readApr 1, 2024

I and Samuele Colombo are the maintainer of the Italian Leaderboard and I have contribute to lm-evaluation-harness adding different evaluation tasks to different languages, mainly for our interest in Italian LLMs as in this PR. As part of the releases above we have evaluated many different Italian open source models on different tasks. And from all our experimentation we have collected many datapoint and we conducted a simple explorative analysis. In this article we will share the data and also some interesting findings.

List of metrics used in the evals:

HellaSwag: Evaluates how well an LLM can complete a sentence https://rowanzellers.com/hellaswag/
MMLU: MMLU Massive Multitask Language Understanding evaluates how well the LLM can multitask https://github.com/hendrycks/test
ARC-c: is an acronym for AI2 Reasoning Challenge — Challenge. It is a subset of the ARC dataset, which is a large-scale collection of multiple-choice questions that require reasoning and commonsense knowledge to answer.
Belebele: is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset
Lambada: LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) is a benchmark whose task is very similar to language modeling. The assignment is to recover a missing word from a portion of text, where the missing word is always the last word of its sentence.
Xcopa: Cross-lingual Choice of Plausible Alternatives (XCOPA), a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.

Command used to reproduce the data:

zero few shot

lm_eval --model hf --model_args pretrained=YOURHUGGINGFACEMODEL --tasks xcopa_it,hellaswag_it,lambada_openai_mt_it,belebele_ita_Latn, arc_it  --device cuda:0 --batch_size 8

only mmlu_it with 5 shot

lm_eval --model hf --model_args pretrained=YOURHUGGINGFACEMODEL --tasks mmlu_it --num_shot 5 --device cuda:0 --batch_size 8

Data Analysis

The data can be seen and downloaded from this gsheet and the simple analyses and visualizations have been produced by this colab.

Models rank

In the below chart the models are ordered by the average on all evaluation metrics but no perplexity:

In the below chart the models ordered by the average of all evaluation metrics with perplexity normalized and calculated as

perplexity = (100 - perp) / 100:

Ranks on single metrics

MMLU_IT few shot 5 is never improved

This is a very interesting findings no models improve mmlu_it compared to mistral-7B-v0.1 base model. Seems all models fine tuned with different strategies: continual pre-training, SFT or DPO are capable to improve on such metric. I suspect that deep knowledge on specific tasks is forgotten when the model is updated on a specific language knowledge. Is like more broad language knowledge is added the less specific capable it will become.

Beleleble

Maestrale works very well on this task it is interesting that Saiga-7B a merged model works well on this task.

Hellaswag

The maestrale series has very strong performance on hellaswag going close to LLAMA-70B a 10x model that has about 70% accuracy and mixtral-7x8 who has about 75% on this task.

Lambada Perplexity

Perplexity measure how much the model is surprised by seeing new data. The lower the perplexity, the better the training is. Interesting that SFT models seems to work better than DPO, and also seeing from maestrale versions that the more STF the worst perplexity.

Lambada openai

The trilogy of zefiro performs well on this benchmark

XCopa

Maestrale series is very strong on that evaluation.

Arc-c

Saiga-7b, a merged model on this task is the best model. Can be valuable trying to understand why on that metric is so strong compared to mistral-7b

Below all models compared on all tasks:

The zefiro trilogy

The trilogy of zefiro has been described in deep training strategies and datasets on this article. As in described above mmlu_it and in the chart below mmlu_it is the only metrics not improved by applying continual pretraing, sft and also dpo. In the majority of the other evaluations every training strategy (continual, SFT, DPO) seems to improve the LLM capabilities. On average respect to the base mistral-7B-v0.1 model there is a gain of about 5% from 45% to 50%. In my opinion it is a good indicator that models can be improved on language specific tasks. I suspect that with more training on more data is possible to improve a lot more. Another interesting findings is that in many metrics from continual pre-training to dpo there is a continuous improving.

Maestrale series

Maestrale is a series of very strong LLM models on Italian tasks and one of the best in many metrics. The data suggests that from version 0.3-alpha to 0.3-beta in some case there has been a small degradation of performance, that can be interesting to understand and discuss. In particular on the perplexity metrics. In any case it performs very well as the best LLM on many evals and on average.

Conclusion

Fine tuned open source models trained on sigle GPUs are been able to improve by 5% foundational base model on average on the evaluation tasks taken in consideration in the article. With more training time on more data this number can be improved a lot. Already on important metrics as arc-c and hellaswag 7 billion model specialized on a specific language can be very close to 10x bigger open source model as LLama-70-b and mixtral-7x8. MMUL is a difficult metric to improve. The evaluation will become a central part of the LLM ecosystem. LLMs can be specialized in different directions also contemporary. Many more specialized evaluations dataset and benchmark will born. Every LLMs can have peculiarities to be discovered.