🌹 FLOR-6.3B: a chinchilla-compliant model for Catalan, Spanish and English

mapama247
5 min readDec 22, 2023

New open-source model based on Bigscience’s BLOOM, trained on 140B tokens.

Following the release of Ǎguila, our research group further explored language adaptation techniques to recycle Large Language Models (LLM) via continued pre-training.

Our main motivation is that low- and mid-resource languages like Catalan cannot afford the widespread strategy of pre-training from scratch with randomly initialized weights. The amount of data required for such an endeavor can be prohibitively large, but by starting from a fully-trained LLM it is possible to benefit from the knowledge already embedded in it.

To do so, the only weights to be adjusted are those of the embedding layer. Given a source and target vocabularies, the weights corresponding to the overlapping tokens are preserved, while the rest are initialized as the average of the source’s embeddings.

Following this approach, we trained FLOR, a new family of open-source models that are based on publicly released checkpoints of BLOOM. These checkpoints have been previously pre-trained on 341B tokens of multilingual data, including 46 natural languages and 13 coding languages.

Catalan and Spanish are among those languages, with an estimated amount of 3.75B tokens and 36.83B tokens, respectively. This inequality highlights an important limitation of highly multilingual models, which is the unfair representation of minority languages in their training data.

So, unlike Ǎguila, which was Falcon-based, FLOR is multilingual at its core. This is beneficial based on our preliminary experiments with smaller models, even if the languages of interest were underrepresented in the training data. Our internal evaluations also confirm that, despite its reduced size, FLOR-6.3B is significantly better than Aguila-7B, although this may also be due to the fact that it has been trained with 5x more data.

Released versions 🤹

Embedding shrinkage 🪆

It is quite obvious that the sizes of FLOR and BLOOM models do not match. This is because, as part of the language adaptation process, the original BLOOM tokenizer was replaced by a custom-made one. The new tokenizer requires fewer unique tokens as it has to account for a much lower number of languages, which is why they differ in terms of vocabulary size:

BLOOM has a larger vocabulary size, a common practice in highly multilingual models.

As a consequence, the height of the embedding layer is significantly reduced, which, in turn, reduces the overall size of the model. The shrinkage is more prominent in smaller versions because the embeddings represent a higher percentage of the total network, which is why
FLOR-760M is roughly 30% times smaller than the original BLOOM-1.1B.

Despite being 5 times smaller, the new vocabulary shares 66% of its tokens with BLOOM’s. The rest are replaced by subwords that are more prevalent in Catalan and Spanish.

Training data 📚

FLOR-7B was trained on a curated corpus of 140B tokens, while the smaller versions used the 26B token corpus previously used to train Ǎguila-7B. Note that the choice of corpus size is not arbitrary at all, since it is the required amount of tokens to train Chinchilla-optimal LLMs.

In both cases, several data sources were added to the mix in an attempt to increase domain diversity. The corpus includes books, Wikipedia articles, doctoral dissertations, legal documents, research papers, patents, news site articles, forum discussions, political discourses, governmental legislation, and massive crawlings (e.g., mC4 or OSCAR), among others. For a detailed list of data sources, refer to the official model card.

The 140B corpus contains the same amount of Catalan, Spanish and English tokens (33% each), unlike the 26B corpus, which has a smaller portion of English data. In the case of Catalan, given its mid-resource nature, some oversampling was required to reach the desired amount of tokens. However, no more than 4 epochs were given to any of the data sources.

It can be said that the resulting corpus reflects the bilingualism present in the Catalan society, where most people are fluent in both languages. Besides, combining both languages makes a lot of sense given their
cultural closeness and high linguistic similarity.

Evaluation 📊

The models have been tested on a set of standard NLP tasks. We perform an extensive evaluation that includes reading comprehension, commonsense reasoning, question answering, natural language inference, paraphrase identification and machine translation.

We choose datasets for which we have data in the three languages of interest, enabling a fair comparison across languages. In all cases, we assess model performance with commonly used metrics in a 5-shot setting. We rely on the well-known Language Model Evaluation Harness from EleutherAI, and extend their framework with additional datasets that have not been used to date.

As it can be see in the results below, FLOR obtains consistent gains across Catalan and Spanish tasks when compared to BLOOM. This table will be extended soon with other open-source models of similar size.

How To Use 🎮

The following is a usage example with the HuggingFace ecosystem.

import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

input_text = "El mercat del barri és fantàstic, hi pots trobar"
model_id = "projecte-aina/FLOR-6.3B"

generator = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device="cuda:0" if torch.cuda.is_available() else "cpu",
device_map="auto",
)
generation = generator(
input_text,
do_sample=True,
max_new_tokens=20,
top_k=10,
)

print(f"Result: {generation[0]['generated_text']}")
El mercat del barri és fantàstic, hi pots trobar gairebé de tot i sempre està molt ple de vida.

Compute Infrastructure 🧮

The FLOR family of models was trained using subsets of Condor Galaxy 1, the AI Supercomputer built by Cerebras Systems and G42. The smaller models were trained using a single Cerebras CS-2 system, while FLOR-6.3B was trained using 16 CS-2s. Cerebras completed the entire training of FLOR-6.3B on 140 billion tokens in 2.5 days.

License ⚖️

FLOR is released as a family of open-weight models, all of them made available under a permissive Apache-2.0 license. The three open-source models are available for use in both research and commercial applications.

Disclaimer ⚠️

Even though we have devoted great efforts to minimize bias and toxicity in our training data, our models may still hallucinate or even produce harmful content. This is a well-known and widespread limitation of generative models, with FLOR being no exception. Make sure to take the necessary safeguards before any production use.

Acknowledgments 🙏

We are very grateful for the collaboration of our colleagues at Cerebras, especially Yishi Xu and Duncan Hoskinson.

Contact ☎️

For further information, feel free to send an email to langtech@bsc.es.

--

--