Llama-2-Amharic: LLMs for Low Resource Languages

9 min readDec 18, 2023

Large language models (LLMs) like GPT, Llama and others have demonstrated unprecedented capabilities in language understanding and generation. These models excel at a variety of tasks, but primarily in languages that are well-represented in their training sets — such as English. However, they struggle when it comes to low-resource languages like Amharic, the most widely spoken language in Ethiopia with approximately 60 million speakers worldwide. Read on to learn how we trained Llama-2 to speak Amharic! (Model weights available here)

How do Large Language Models Work?

At their core, LLMs work by predicting what comes next in a string of text. For long sentences, the LLM will make predictions repeatedly. If an LLM is asked “Please name a common flavor of sandwich,” it will predict one step at a time, looking at the existing sequence (in black) to predict the next word (in green). Eventually it will predict a “stop token”, which means it has found a good place to end the response.

LLM predicting the sequence “Peanut butter and jelly” one step at a time — LLM predicting “Peanut butter and jelly” one step at a time

When a model is trained, it looks at a large amount of text and repeatedly tries to predict the next word. Early on in training, it performs poorly at this task, guessing randomly and making nonsensical predictions. But it learns with each incorrect prediction. Models like GPT see trillions of words throughout the course of their training process, and with this much data, they are much better at making accurate predictions by the time they finish training.

The Challenge of Low-Resource Languages

Languages like Amharic are often classified as “low-resource” due to the limited availability of training data in comparison to “high-resource languages,” which are richly represented in most training sets. Large models can seamlessly process multiple high-resource languages but tend to fall short when it comes to less-represented languages. For Amharic, there are less than 1 billion words across public datasets, which is over 1000–10000x less than the number of words for English. With so much less data, it is far more difficult for a model to learn.

The Importance of Tokenization

Apart from the scarcity of training data, tokenization is another crucial factor that can affect the performance of language models in low-resource languages. A tokenizer is a component that breaks down input text into smaller pieces or “tokens” for ease of processing by a model. In languages like English, tokenizers generally use about 0.5 to 1 tokens per word, and each one is represented to the model as a unique number.

Llama’s tokenizer encodes “Hi, how are you” to a sequence of numbers: [1, 6324, 29892, 920, 526, 366, 29973].

Common phrases or words may have specific tokens dedicated to them, which makes learning easier. With Amharic, a single character often gets split into multiple tokens — and these tokens are not dedicated to Amharic words and phrases, but instead are combinations of “default” tokens. You can try pasting various phrases into the GPT-4 tokenizer and see how many tokens are used by each one.

This inefficient tokenization can lead to lackluster performance. The model learns a worse internal representation of the concepts behind these characters, and it also requires many more tokens to either “read” or “write” them. This means that in a model with a given context length (i.e. 4000 tokens), far less information can be provided as input or generated as output, and the rate of generation will be much slower. Try speaking to GPT-4 in Amharic, and the output will likely be slower than in English.

Learning from Existing Work

After the release of Llama, researchers successfully adapted the model for Chinese, which also suffered from limited data and a poor tokenized representation. They accomplished this by designing a specialized tokenizer for Chinese and then extending the pre-training phase before fine-tuning the model. Inspired by their paper, we decided to implement a similar approach for Amharic with Llama-2.

Custom Tokenizer

We knew that creating a custom tokenizer was crucial for improving Llama-2’s performance with Amharic. To tackle this, we collected a corpus of Amharic text from various web sources and utilized Google’s SentencePiece to learn our new tokenizer. This gave us a more efficient tokenization process tailored to Amharic, and it laid a strong foundation for the subsequent training phases. SentencePiece looks at the distribution of various words and phrases in text, and figures out which ones are the more statistically common parts of speech.

The Amharic Llama Tokenizer uses 1/6 the number of tokens for the same Amharic text.

Pre-Training

Conceptually, pre-training is pretty simple. The model sees lots of text, and repeatedly tries (often failing) to predict the next token. It slowly learns from its mistakes, and the predictions improve over time. We collected a small mix of open source Amharic text, and trained the Llama-2 model in this manner — picking up right where the original Llama training had left off, but with mostly new tokens that the model had never seen before — the Amharic language.

Fine-Tuning

A pretrained model will often be very good at what it was trained to do — complete text by guessing what comes next. But this is not so helpful for many tasks, especially conversational ones. Fine-tuning is an additional training step that we can perform to focus the extensive knowledge of a pretrained model for use on specific task, such as answering questions, writing stories, or summarizing text. A popular fine tuning approach consists of preparing question answer pairs of conversation between an AI and a user, and then training the model to produce correct AI responses. Many open source datasets of this kind exist for English, but none for Amharic. Instead, we used the Google Translate API to translate English datasets into Amharic. While the translations are not perfect, they are surprisingly accurate, and preferable to not doing any finetuning at all.

Initial Results

The training run was a success at first glance! After completing fine-tuning, we had a model that could respond to questions in fluent Amharic, and provide half-decent answers. But we quickly realized the model was heavily biased to certain topics. It would often veer off topic and begin reciting bible verses, news, or both. It suffered from poor performance in other areas and often produced incoherent text. The most likely culprit — a heavily biased dataset. As is the case with many low-resource languages, the majority of available digital text is heavily biased to certain areas — in this case, news articles and religious texts.

The paper on Chinese Llama that we had used as a guide relied on significantly larger pre-existing datasets. While Chinese was poorly represented in the Llama training set, it is much more common across the internet than Amharic, and covers a far wider range of topics. If we wanted a performant model, we had to find another source of text.

Data Augmentation

While state of the art models like Llama are trained on trillions of tokens across many terabytes of text, we were able to find only a couple gigabytes of Amharic text with less than 500 million tokens. Although we couldn’t realistically reach trillions of tokens, we wanted to see how quickly the performance of the model might improve with more data, and more importantly, more diverse data. After poring over numerous open source datasets and finding hardly any Amharic text, we decided to take a different approach entirely. What if we used translation to obtain not only fine-tuning conversation pairs, but also the pre-training dataset? Our first thought was to use a translation API again, but to translate even a few hundred million tokens this way would be prohibitively expensive (for fine-tuning, we only needed a few thousand).

Around the same time we were pondering this problem, researchers at Meta recently released an open source translation model called SeamlessM4T, offering high quality translation for over 100 languages, including several low-resource languages like Amharic. With this approach, we were able to perform the translations locally and quickly obtain over 3 billion Amharic tokens, giving us a total dataset of around 3.8 billion tokens. Our hope was that while performance would likely be bounded by the quality of the translation model, it would still significantly exceed the performance of the biased original model.

Evaluation

Qualitatively, our model seemed to pass the eye test, and offered impressive results after training. It seemed both more knowledgeable, and better able to produce coherent paragraphs without suddenly rambling about news.

Quantitatively evaluating language models is challenging even in English, and while numerous evaluation benchmarks exist, the best way to quantify LLM performance remains an open question. No such datasets exist for Amharic, so as a baseline, we picked a popular English benchmark (MMLU) and translated it to Amharic. MMLU measures LLM performance in various subjects including Math, Science, Philosophy, and more by asking multiple choice questions.

Performance on our “Amharic-MMLU” was varied, with the model performing best in areas like law and sociology, and worst in math and chemistry. In a few cases, the original model with the smaller dataset performed better. In some cases, both models scored worse than a baseline random guess! We suspect that STEM scores suffer the most from minor translation errors, as a single character change could alter the meaning of a question. For example, a single mistranslation might make a math question become indecipherable, or make a wrong answer appear to be correct. The hacky approach of translating the benchmarking data to Amharic likely exacerbated this issue for these topics.

Amharic MMLU: Topics where the 3.8b model performed better

Amharic MMLU: Topics where the 436m model performed better

Despite the issues with our rudimentary evaluation, we observed a clear trend of substantial improvement after adding the translated tokens to the pre-training dataset. On average across all topics, the 3.8b model scored 0.29 while the 436m model scored 0.26, barely passing the baseline of random guessing. The 3.8b model’s highest score was 0.52 on international_law. The worst scores for both models were on formal_logic, with neither model even matching the baseline.

An example of a formal_logic question where the translation model failed to translate certain symbols, leading to a difficult task for the model.

Next Steps

There are numerous ways to improve and expand upon the approach we’ve described here. At a minimum, we could improve our process with more real data. Collecting billions of tokens is a daunting and perhaps temporarily intractable task, but even a few thousand tokens for use in fine tuning — or just a few hundred for evaluation! — could have an outsized effect.

More experiments around the selection of English data for translation would also likely prove useful. For example, which types of data are most valuable to translate? How should we allocate translation resources? What about using different sized models (we used the smallest one, Llama-2-7b). And finally, what if we just keep translating more data to further grow our dataset?

We have released Llama-2-Amharic on HuggingFace, and plan to share a full technical report in the near future.

Github repo: https://github.com/iocuydi/amharic-llama-llava

Huge thanks to the Google for Startups and Google Cloud teams for helping us with access to A100 GPUs and cloud credits for this project and others!

Get in touch: developer@garrilogistics.com

Llama-2-Amharic: LLMs for Low Resource Languages

Written by Mike Andersland