Can Alpacas Learn Languages?

Marcin Mosiołek
6 min readApr 3, 2023

--

LLaMA is a state-of-the-art, foundational, open-source large language model designed to help engineers and researchers advance their work in NLP. For example, Stanford researchers have fine-tuned LLaMA to construct an alternative to the famous ChatGPT — a model called Alpaca.

Unfortunately, LLaMA was trained on a dataset consisting mainly of English texts, with only small percentage of the data relating to other languages. In addition, the Alpaca instruction training dataset consists only of examples of English instructions. So Alpaca simply doesn’t work well for the other languages.

The question is:

Is the small amount of non-English text present in the pre-training dataset sufficient to effectively capture the under-represented languages and construct an instruction-following model just by LoRA fine-tuning on a language-specific dataset?

I’ll try to answer this question and create a toy, language-specific Alpaca model. My mother-tongue is Polish so I’ll conduct the experiments in Polish. But don’t worry — I’ll translate all the results back into English.

Translation

To perform the model finetuning, I had to collect Polish instructions dataset. I decided to simply translate the original dataset used to construct Alpaca for the specific fine-tuning of the Polish instructions.

Unfortunately, a naive approach to translation would come with significant limitations, as some of the tasks from the original datasets can’t simply be translated into another language. For example, we can’t propagate orthographic errors from one language to another, or automatically translate poems and keep the rhymes. But there are many others. To reduce the effect of the above I decided to translate only the instruction and inputs fields, and then regenerate the outputs.

Both for the translation and outputs regeneration, I relied on GPT-3.5-Turbo and the free $18 credits granted by the OpenAI platform. Unfortunately, the cost of the process exceeded the amount granted, so I had to add $10 from my own pocket. Although it was still extremely cheap, it took roughly 24 hours to complete (5-threads running in parallel). The result of the translation is open sourced and can be found here.

The following prompt was used for the translation and is based on this paper:

f"Please provide the Polish translation for these sentences: [TEXT]"

In addition, I extended the original instruction asking not to generate more than 256 characters, otherwise GPT-3.5-Turbo returns long, elaborate answers. This is costly (the OpenAI platform charges per token in both input and output) and may go beyond the capabilities of LLaMa-Lora-7B’s fine-tuning.

Training

Training your own Alpaca model is easy and fast thanks to the Alpaca Lora implementation. I simply followed the steps described in the README file and used the default finetuning instruction with LLaMa-Lora-7B as the base model. The only change I did, was to use my own translated dataset:

python finetune.py
--base_model='decapoda-research/llama-7b-hf'
--num_epochs=10
--cutoff_len=512
--group_by_length
--output_dir='./polpaca-lora-7b'
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]'
--lora_r=16
--data_path=pl_alpaca_data_cleaned.json
--micro_batch_size=8

The trained model is called polpaca-lora-7B and might be found here.

Evaluation

Does it work? Let’s see some examples (translated back to English):

{
instruction="Generate 5 interview questions",
output="
1. What experience do you have in this role?
2. What challenges have you encountered in your previous work?
3. What strategies did you use to solve problems?
4. What skills and qualities do you consider most important in this role?
5. Wow do you deal with difficult situations?
"
}

Not bad, right? Let’s try another one:

{
instruction="
Write a polite message to a work colleague and congratulate
him on his wedding
",
output="
Dear [Name],

I hope this day is full of joy and gratitude for you. You have
started your life together with [Your Partner] and I am sure it
will be wonderful.

I am very grateful for your work and dedication to our team.
You are an unforgettable member of our team and I am sure it will
get even better.

Sincerely,
[Your Name].
"
}

This one looks shlightly worse. Actually, before translating back to English, the Polish output looks kinda weird — like it was written by a robot. This comes from the construction stage of the finetuning dataset — machine translation produces artificial results in some scenarios.

Now let’s take a look at the famous example presented on the Stanford Alpaca project page.


{
en_instruction="What is an alpaca? How is it different from a llama?",
pl_instruction="Co to jest alpaka? Czym różni się od lamy",
output="
The alpaca is a mammal of the elk family that is very similar to
the llama. Alpacas are much larger than llamas and have thicker fur.
Alpacas also have longer and more curved horns.
"
}

Wait, what? There are alpacas in the picture above and none of them have horns! But maybe the Polish model doesn’t understand the Polish words for alpaca and llama (alpaka, lama), because it has never seen them in Polish texts? Let’s try again, but with English words.

{
en_instruction="What is an alpaca? How is it different from a llama?",
pl_instruction="Co to jest alpaca? Czym różni się od llama",
output="
The alpaca is a breed of domestic animal that is native to the Andes
Mountains in South America. The alpaca is much smaller than the llama
and has softer and finer fur.
"
}

Indeed, it seems like polpaca-lora-7B doesn’t understand some of the Polish words as after replacing them with their English counterparts it returns the perfect answer. I’ll take another try to confirm this finding:


{
en_instruction="What is a cruiser motorbike?",
pl_instruction="Co to za rodzaj motocykla: krążownik",
output="
A cruiser motorbike is a type of motorbike that is designed to allow
the rider to ride in both directions.
"
}

That’s an absurd. Let me replace the Polish word “krążownik” with its english counterpart “cruiser” and run the query again”

{
en_nstruction="What is a cruiser motorbike?",
pl_instruction="Co to za rodzaj motocykla: cruiser",
output="
A cruiser motorbike is a type of motorbike characterised by a long,
wide body, high chassis and wide tyres.
"
}

I’m not a bike expert, but this answer seems to be way more legit.

Btw. here is the config used to generate the above results. To reproduce them simple run generate.py from the Alpaca Lora and adjust the inputs accordingly.

GenerationConfig(
temperature=0.5,
top_p=0.75,
top_k=40,
repetition_penalty=1.5,
num_beams=4,
max_new_tokens=256,
)

Conclusion

The overall impression of the model’s performance is surprisingly positive, given the effort that went into this toy experiment. As long as I don’t use complex instructions, proper names or rare terms, the results look reasonably good. However, in some scenarios it is necessary to play around with the prompts.

On the other hand, the model lacks a good understanding of the Polish language and generates funny answers in scenarios where it doesn’t know some of the words. It also tends to produce grammatical errors and needs a higher temperature to make it more “creative”, but also an increase in the repetition_penalty value to prevent it from generating the same content over and over again. It seems to have a rather limited “vocabulary”.

To make the model fully Polish compatible I’d actually suggest finetuning or pretraining LLaMa on a Polish corpus at first and then further finetune to follow the instructions. However the instruction dataset should be revised, not just blindly translated as it was performed in this experiment.

So yes, alpacas can learn languages, but they seem to have trouble becoming good speakers — at least taught this way;) Therefore, I discourage anyone to treat this model seriously and considering usage in customer facing products. However, I encourage you to play with the model and share your findings too!

--

--