Introducing Ǎguila, a new open-source LLM for Spanish and Catalan

mapama247
6 min readJul 17, 2023

A Falcon-based model with 7B parameters that has been further pre-trained on 26B tokens of open access data.

AI-generated image of an Iberian imperial eagle, an endemic bird of the Iberian Peninsula.

The Language Technologies Unit from Barcelona Supercomputing Center is releasing a new open-source Large Language Model (LLM), licensed for both research and commercial use.

Ǎguila is a 7B parameters LLM that has been trained on a mixture of Spanish, Catalan and English data, adding up to a total of 26B tokens.
It uses the Falcon-7b model as a starting point, a state-of-the-art English language model that was openly released just a few months ago by the Technology Innovation Institute.

This blog post provides a general overview of the work performed to train the Ǎguila model. For further details, refer to the official model card.

Continual pre-training from a different language ♻️

The amount of data required to pre-train foundational models grows with the number of parameters. This means that when scaling up to larger model sizes, it can become quite expensive to train an LLM from scratch. For many languages, and in particular those with scarce resources, it can be very hard to collect the amount of data required to properly train billion-parameter models.

This work explores the possibility of using an English LLM as a starting point to train a model in a different language. In particular, we adapt the Falcon-7B model to two additional languages, namely Spanish and Catalan, by swapping the tokenizer and adjusting the embedding layer.

The main motivation behind this approach is to leverage all the knowledge that Falcon-7B has already acquired from a vast amount of English data, hoping that it will transfer to our target languages. The final model can then benefit from a non-random initialization of its weights, and as a result, it should require fewer tokens to master new languages.

Plot of the training loss (blue) and validation loss (orange).

The first step for a successful language adaptation is the replacement of the model’s tokenizer. This is crucial because using the original English-based tokenizer would lead to a very high token-to-word ratio. Thus, a new BPE tokenizer was trained on a mixture of Spanish, Catalan and English data. Secondly, the embedding layer is modified by keeping only the weights that correspond to shared tokens (those present both in the old and new tokenizer) and replacing the rest with the overall mean value.

Once the model has been successfully initialized, it is then possible to start a standard pre-training procedure with our trilingual corpus.

Training data 📚

Our training corpus is composed of 26B tokens from different sources. It includes Spanish and Catalan data in equal proportions (roughly 40% each) and a smaller amount of English data (~17%) to prevent a catastrophic forgetting of the original language. The table below breaks down the training corpus into the different datasets that compose it:

Table of datasets from the model card.

It is important to note that, in addition to these 26B tokens, the model had previously seen 1,500B tokens of mostly English text. This is the size of the corpus used to train Falcon-7b, which is mainly composed of a massive common crawl (see the RefinedWeb paper), and a smaller percentage of books, conversations, code and technical reports. We expect most of the knowledge acquired in that preliminary phase to be preserved in the further pre-trained checkpoint, enabling knowledge transfer between languages and greatly reducing the training cost.

Evaluation 📊

It is well-known by the NLP community that the evaluation of generative models is quite a challenging task. Current benchmarks are not sufficient to evaluate the model capabilities in languages other than English, in our specific case, due to the lack of Spanish and Catalan datasets that are suitable for decoder-only model assessment (e.g. common sense reasoning, reading comprehension, word prediction, etc).

So far, we have only conducted a qualitative study at a small scale, but we intend to perform a thorough human evaluation and to collect results of zero- and few-shot experiments on standard benchmarks.

Cherry-picked examples 🍒

Ǎguila’s capabilities are showcased in this section with a few examples of real interactions. Here we can see how the model performs in a zero-shot setting, when prompted to answer specific questions or to complete a given text. Note that it can do so regardless of the input language, being able to respond in Catalan and Spanish alike. In fact, as illustrated by one of the examples, it has even proven to be useful for translation.

Additionally, when prompted with a few examples, the base model is able to perform complex tasks such as Named-Entity Recognition or paraphrase generation. The following are some examples that exhibit the model’s behavior in a few-shot setting:

Finally, here are also a few cases of single-turn instruction following, in this case, using an instruction-tuned version of the model:

How To Use 🎮

A Python code snippet with a very simple usage example:

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

input_text = "El mercat del barri és fantàstic, hi pots trobar"

model_id = "projecte-aina/aguila-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
"text-generation",
model=model_id,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
generation = generator(
input_text,
do_sample=True,
top_k=10,
eos_token_id=tokenizer.eos_token_id,
)

print(f"Result: {generation[0]['generated_text']}")

It is highly recommended to read this blog post about how to use the HuggingFace ecosystem to fine-tune and perform inference with Falcon. Everything said there applies to our model as well. Please note that in order to use Falcon-based models with Transformers, PyTorch 2.0 is required.

Released versions 🤹

We are releasing the base model under a permissive Apache 2.0 license.

In addition, and purely out of scientific interest, we have also created an instruction-tuned version. This one, however, is only intended for non-commercial use since it was trained on the Alpaca dataset, which is subject to OpenAI’s Terms of Use. We are currently working on a translation of the dolly-15k dataset that will allow us to release an instructed version of Ǎguila under a commercial-friendly license. This section will be updated with any newly released checkpoint.

Compute Infrastructure 🧮

Training was conducted with HuggingFace’s Transformers in a single NVIDIA DGX node composed of 8 80GB H100 GPUs, and took around 320 hours in total. Due to memory constraints, it was necessary to partition the optimizer states using DeepSpeed’s Zero Redundancy Optimizer.

License ⚖️

The base model, Ǎguila-7B, is released under an Apache-2.0 license, making it available for both commercial and research purposes. This means that any organization or NLP practitioner is free to use the model weights as it pleases, with the possibility of performing fine-tunings for specific use cases. It goes without saying that the license of any potential fine-tuned version will be subject to the type of data used for such purpose.

Disclaimer ⚠️

Even though we have devoted great efforts to minimize bias and toxicity in our training data, our models may still hallucinate or even produce harmful content. This is a well-known and widespread limitation of generative models, with Ǎguila being no exception. Make sure to take the necessary safeguards before any production use.

Contact ☎️

For further information, please send an email to langtech@bsc.es.
Feel free to contact us if you are interested in any sort of collaboration, or simply to let us know about real use cases given to our models :)

--

--