Member-only story
FanFabler: Fine-Tuning Llama 3 to Be a Multilingual Fanfic Writing Assistant
How I used a custom training dataset and information retrieval for global storytelling. 好样的! Bravo! वाह! ¡Guau! 브라보!
The rise of Large Language Models (LLMs) has ushered in a new era of text-based AI systems. Although these models are very good and highly capable, their training predominantly focuses on English. The largest commercial LLMs generate text well using “low resource” languages, while the smaller open-source models don’t fare well with non-European languages.
However, Meta trained the new Llama 3 model with a wider variety of languages, as they announced in a post when it was released[1].
To train the best language model, the curation of a large, high-quality training dataset is paramount. In line with our design principles, we invested heavily in pretraining data. … To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages. However, we do not expect the same level of performance in these languages as in English. — Meta
Five percent doesn’t sound like much, but it’s more than its previous versions of Llama [2] and other small LLMs like Mistral [3]…