Making LLMs Speak Non-English Languages

4 min readNov 20, 2023

Falcon generated with DALL-E3 (Cosmic Dream)

Language models have become increasingly popular, with new open-source models emerging daily. However, these are primarily focused on English, but there is also an interest in developing and evaluating LLMs in other languages. Aguila-7b, by project-aina, is one such initiative that uses continual pre-training to create a 7-billion parameter language model that extends Falcon-7b to Catalan, Spanish, and English. The training of Falcon-7b into Spanish and Catalan was done on a single node with 8x80GB H100 GPU and took around 320 hours, this would be around 11000 Euros on runpod.io so the cost although lower than training a foundational model from scratch its not insignificant.

At Abzu, we frequently explore the idea of expanding our chat agents’ capabilities to include multiple languages. The critical questions we face include the cost of such an expansion and whether the model would retain its logical skills. Our aim is to delve into these aspects by examining and contrasting the Falcon and Aguila models.

Continual Pre-Training Approach

Continual pre-training enables Aguila to use existing knowledge from Falcon 7B and train a model without needing a large corpus of Catalan and Spanish text, while also reducing training costs. However, further training can lead to better performance in the extended languages but may also degrade areas not part of the continual pre-training. Therefore, we wanted to do a quick benchmark that quantifies the performance gains and losses in this process. More details are excellently explained by mapama247 in the medium post ‘Introducing Ǎguila, a new open-source LLM for Spanish and Catalan’

Evaluation and Results

To evaluate Aguila-7b and Falcon-7b, we used the dataset, xquad, which contains a context, question, and answer that the model has to generate. Given that we are dealing with a foundational model, we used a few-shot technique with few examples before the question to make the model answer the question.

Here is a sample prompt:

Six-time Grammy winner and Academy Award nominee Lady Gaga performed the national anthem, while Academy Award winner Marlee Matlin provided American Sign Language (ASL) translation.
 - - 
Question: What did Lady Gaga sing?
Response: the national anthem
 - - 
Question: Who sang the national anthem?
Response: Lady Gaga
 - - 
Question: How many Grammys has Lady Gaga won?
Response:

We compared both models on English and Catalan versions of the dataset.

The results are as follows:

| Model  | Task  | Lang | F1 (mean) |
|--------|-------|------|-----------|
| Aguila | xquad | ca   | 0.507363  |
| Aguila | xquad | en   | 0.523871  |
| Falcon | xquad | ca   | 0.477440  |
| Falcon | xquad | en   | 0.750167  |

As expected, Aguila achieved the highest F1 score on the Catalan dataset version. However, it performed worse than Falcon on the English version. This suggests that improving Falcon’s performance in Catalan may come at the cost of its performance in English. Interestingly, Falcon also performs well on Catalan without pre-training.

Limitations of the F1 Score

We chose the F1 score for its simplicity, but we acknowledge that it may not be the most accurate metric. For instance F1 score does not capture aspects like synonyms, which can be crucial for understanding the context and generating accurate answers. Also in the example about how many Grammys Lady Gaga has won it is worth noting for instance that both Aguila and Falcon get an F1 score of 0 even though the prediction of Falcon is 6 (correct) but the answer is “six” in letters so the F1 metric is 0.

Despite this for quantifying the generative capabilities the metric gives a good hint and serves as an initial estimate of how they models compare.

Conclusion

We have evaluated Aguila-7b and the Falcon-7b on a Catalan and English question and answer dataset to try to compare the performance gains and losses of continually pre-training an English base model for a new language. According to the initial results, Aguila has a slight improvement in Catalan but at a high cost of the performance in English.

Future work includes a more thorough evaluation with more datasets and other tasks as well as considering better evaluation metrics for text comparison. As the authors of Aguila mention, the lack of good non-English datasets for generative language model evaluation is key for progress in the field and why initiatives like that of project-aina and similar require funding and public support.

Stay tuned for a follow-up post where we introduce a new language model design for Catalan/Spanish that we will compare with Aguila and Falcon. This post was also co-written using the language model that we will talk about more about in the next post.

Sources/References

You can find the code to reproduce these results in our github:

https://github.com/abzu-ai/poligloteval

Co-authored with Emil L. Larsen and our LLM Abzuito :)

Making LLMs Speak Non-English Languages

Written by Victor Galindo