Are Llama3 and Mixtral open models ready for RAG in Greek?

Vassilios Antonopoulos
11tensors
Published in
5 min readApr 30, 2024

Open source models have been shaping the AI field in recent weeks. Mistral released a new MoE 8x22B model and a few days ago Meta introduced the first models of its Llama3 series with 8B and 70B parameters (a larger one 400B is expected later this year). These models climbed near the top of all the existing leaderboards (Huggingface, LMSYS) proving that they are extremely powerful for a large variety of applications. Although they seem to better support more languages, of course Greek is not among them. So, at this point in time, are we still left with closed systems like those of Open AI and Cohere for applications dealing with Greek textual data?

Let’s see where we are.

We have interacted with Llama3–70B and Mixtral MoE 8x22B. You could clearly understand that the models are quite capable of conversing in Greek and following instructions to handle Greek content in several scenarios. It is obvious that they can understand very well the meaning and concepts of the Greek text. Mixtral MoE 8x7B was also quite capable, although you could notice some grammatical or syntactical errors in its text generations.

One of the most common AI components is RAG (Retrieval Augmented Generation). Especially if you only need AI on your data. RAG is in the heart of most AI systems today. In 11tensors we have developed end-to-end solutions incorporating RAG architectures for systems such as:

  1. Conversational agents (chatbots)
  2. Question — Answering based on own data (external Knowledge Bases)
  3. Search engines

Multiple LLMs in multiple roles can be used in the above systems. We have focused on the generator role of the RAG architecture, the LLM that combines the retrieved text segments and forms the final response to the user, and made an evaluation, which is quite interesting.

So for this task, we started by selecting a similar benchmark. We have used the Greek version of the belebele dataset, which is a reading comprehension task containing multiple choice questions on a given text excerpt (https://arxiv.org/abs/2308.16884). Recently, Institute for Language and Speech Processing (ILSP/Athena R.C.) has released a fine-tuned version of Mistral-7B-v0.1, called Meltemi, pretrained and fine-tuned in Greek (https://huggingface.co/ilsp/Meltemi-7B-v1). Although the Instruct version of the model lacks performance, compared to what we are used to from the existing top LLMs and needs further improvement in user alignment and instruction following, both its versions are very good in generating correct Greek text. This model is therefore a benchmark for Greek. We have successfully replicated the published 0.6367 score of the base model in the belebele ell task using EleutherAI LM Evaluation Harness and compared it to the Llama3 8B base and Instruct models. The results are shown in the table below.

It is interesting to note that the Llama3 small model of 8B parameters clearly outperforms the Greek model in this Greek benchmark. Llama3 seems to have a very high comprehension level when handling Greek texts. This observation brought us to the next step. Compare the models on a RAG task.

11tensors possesses the largest available online Greek textual corpus, crawled from public sources between 2009–2021 and cleaned for media analysis and information extraction. We use this corpus to synthetically create datasets and fine-tune LLMs according to our clients’ needs. We have previously created a proprietary RAG-specific dataset that contains triplets of Questions-Context-Answers. We use this dataset for fine-tuning LLMs and in this case, we have fine-tuned Meltemi-7B-Instruct-v1 and created the model meltemi-11tensors-RAG.

We needed a similar dataset for our RAG evaluation, so we created a new one by selecting a random set of 50 documents and creating 5 question-answer pairs for each one of them, resulting in a final set of 250 triplets. Starting with these questions, we implemented a typical RAG workflow. Documents were initially chunked, embedded with a Sentence Transformer model and imported into a Vector database. Then, starting with the set of 250 questions, we retrieved the 3 most similar chunks from the database to use as common context for all the LLMs we wanted to evaluate. We didn’t optimize this procedure on purpose, so that all models are evaluated under common conditions and rather difficult cases.

The models we evaluated were the following:

  • Meltemi-7B-Instruct-v1
  • Meltemi-11tensors-RAG
  • Llama3–8B-Instruct
  • Llama3–70B-Instruct (served by Groq — btw… insane token/sec speed, just insane!)
  • Mistral Mixtral MoE 8x22B (served by Mistral AI API)

For the evaluation we followed the LLM-as-a-Judge approach and have used Open AI’s GPT3.5 model. We have selected the 2 metrics that measure the performance of RAG’s last component that we are interested in: Answer correctness (compared to ground truth) and relevancy (compared to query and context) using Prometheus model templates as discussed here. The results are presented below.

There are some conclusions and observations that we can make:

  • 11tensors’ fine-tuned model, although significantly smaller, performs on par with the larger models. Not a surprise though… This model is fine-tuned for this specific task. This is the beauty of adapting and using smaller models for specific tasks.
  • The previous observation also indicates the importance and the potential of Meltemi LLM. It seems that it can be a basis to build very efficient solutions in various applications dealing with Greek text.
  • Both the Llama3–70B and Mixtral models confirm that they can understand Greek language very well and also successfully generate syntactically and grammatically correct Greek text. These models are very powerful in reasoning, in instruction following and the fact that they perform so well in this task signifies that they can be further used in a wide range of applications involving Greek text.
  • The Llama3–8B-Instruct model, while performing very well on the reading comprehension task, struggles to generate correct Greek text efficiently on this taskand its score seems low. However, we expect it to perform much better if it is fine-tuned on the task (see here our post for Llama2), although the extent of its general applicability compared to the larger models is questionable.

Finally, to return to our original question that led to this analysis… well… actually yes! Although Greek is underrepresented in their training datasets, compared to the widely spoken languages, the models are still very capable and demonstrate a robust performance. Moreover, they offer the flexibility for further fine-tuning to tailor them specifically to our unique use cases.

--

--