A Greek LLM after fine-tuning Llama2 7b

Published in

11tensors

7 min readDec 6, 2023

As a language not widely spoken on an international scale, Greek often faces challenges in terms of digital representation and accessibility. In the era of LLMs, the existence of one tailored to the nuances of the Greek language, offering performance on par with models trained for more widely spoken languages, would be a resource that could facilitate the creation of diverse applications, ultimately fostering a more inclusive and dynamic digital landscape for the Greek-speaking population. So this post is the first of an upcoming series exploring available options and models towards the direction of building one or more specialized Greek LLMs.

Open AI’s models, both GPT-3.5 and GPT-4 do offer a LLM with advanced capabilities, even for the Greek language. But this comes with significant caveats. First of all, the cost associated with its usage is a highly prohibited factor for many applications and users. It’s a solution that cannot be widely adopted. And in many scenarios, lighter and more specialized LLMs (or even trained deep learning models) would do the job faster, cost-effectively and with better performance.

11tensors possesses a very large corpus of around 1.5 billion Greek documents, by far the largest available online Greek corpus. Building a model from scratch isn’t an easy task, it’s very resource intensive and requires a large investment. On the contrary, fine-tuning an existing foundation model with high quality task-specific training data is a process that we can follow and create specialized LLMs on the Greek language for various tasks.

Open source models are closing the gap with the closed ones. The collaborative efforts of the open-source community have led to the development of sophisticated models that rival their proprietary counterparts in terms of performance and versatility. Two models have gained the most traction lately: Llama2 and Mistral.

Mistral is a fantastic model, outperforming all models under 30B and even beating 70B models in certain benchmarks. But it doesn’t seem to have incorporated adequate amount of data in the Greek language during its training. The model seems incapable of successfully generating text in Greek and its instruct version doesn’t follow the instructions given to process Greek texts. So, unfortunately, we should leave this model out. At least its current version (upd: confirmed that Mistral’s new model Mixtral doesn’t support Greek either).

On the other hand, Meta’s Llama2 seem to be able to understand basic Greek language and concepts. This is something that can be observed with the large 70 billion parameter model. It will respond most of the times in English, it won’t be consistently correct, but it gives you the impression that it has understood your request expressed in Greek. As far as the smaller 7B model is concerned, it doesn’t seem to be able to generate Greek text and follow instructions. But this is the type of model we are mostly interested in. 7B category models are generally more useful and suitable for wider adoption. A model that when quantized can run on your desktop or laptop (even on CPU), a model performing extremely fast and highly efficiently is our primary focus. But it seems very difficult to get from Llama2–7b model a correct response or useful generated text in Greek.

Our feeling is that the model has captured during the pretraining procedure information that makes it capable of handling information in Greek language, but we need to further train and fine-tune the model to achieve the required performance.

In order to confirm our feeling, we have decided to start with the Llama2–7b-chat model and fine-tune it with Greek text on a specific task.

We created a dataset of 80k documents (public articles of Greek sites) properly transformed to train the foundation model on the following 2 tasks:

a) text generation: headline -> story

b) summarization: story -> headline

The model was trained with the Axolotl package, using Deepspeed for parallelization on a 4 x RTX4090 GPU instance on vast.ai for 12 hours for 1 epoch. QLoRA was used (all target modules were trained).

We can see some stats on the training and evaluation loss below.

The results of this fine-tuning process were surprisingly good! Unlike the Llama2–7b model, our model always now responds in Greek whenever it is asked in Greek. As it has not been trained with several instructions in Greek, system prompts and the typical prompting words (such as Question, Context, Response, Answer etc.) should be written in English for the model to respond well. Considering its new capabilities and initial performance evaluation, the current model seems to be a very good candidate for RAG (Retrieval Augmented Generation) and Summarization applications in the Greek language.

Some interesting sample responses are presented below. The results are really impressive considering the number of texts used for the fine tune, so the potential of a larger fine-tune, with several types of instructions, manually and synthetically extracted from our corpus seems very big. It will be done along with the creation of a benchmark as no one exists yet for the Greek language.

Sample 11tensors model responses:

Prompt: Read the text below and answer the question using only information from the text.

Κάθε φυσικό σύστημα περιέχει (ή εναλλακτικά αποθηκεύει) μία ποσότητα που ονομάζεται ενέργεια. Ενέργεια, συνεπώς, είναι η ικανότητα ενός σώματος ή συστήματος να παραγάγει έργο. Ανάλογα με τον τρόπο που έχει αποκτηθεί, ανταλλαχθεί ή αποθηκευτεί, μπορούμε να μιλήσουμε για πολλές μορφές ενέργειας: Μηχανική ενέργεια, που συνδυάζει την κινητική και τη δυναμική. Ηλεκτρομαγνητική ενέργεια, που συνδυάζει την ηλεκτρική και τη φωτεινή ή ενέργεια ακτινοβολίας, Πυρηνική ενέργεια Θερμική ενέργεια Χημική ενέργεια Υλο-ενέργεια.

Question: Πες μου 4 διαφορετικές μορφές ενέργειας.

Answer: Μηχανική ενέργεια, ηλεκτρομαγνητική ενέργεια, πυρηνική ενέργεια, θερμική ενέργεια

Prompt: Read the context below and answer the question using only the information provided in the context.

Context: Σύμφωνα με ανακοίνωση των αρχών άγνωστοι δράστες, που ήταν ένοπλοι, επιτέθηκαν σε χρηματαποστολή επιχείρησης, η οποία ανήκει στην οικογένεια της συζύγου του Λιονέλ Μέσι, Αντονέλα. Οι δράστες επιτέθηκαν στο αυτοκίνητο που μετέφερε δύο υπαλλήλους του σούπερ μάρκετ, καθώς και τη ξαδέρφη της αγαπημένης του Αργεντινού σταρ, σπάζοντας τα παράθυρα. Αφού το διέρρηξαν, αφαίρεσαν δύο τσάντες που περιείχαν εισπράξεις ύψους 19.700 ευρώ, τα οποία επρόκειτο να κατατεθούν σε τράπεζα σε κεντρικό σημείο του Ροζάριο.

Question: Πόσα χρήματα είχαν οι τσάντες που πήραν οι διαρρήκτες? Answer: 19.700 ευρώ

Prompt: Γράψε έναν τίτλο για την ακόλουθη ιστορία:

Με 37 έδρες σε σύνολο 150 της βουλής, το ισλαμοφοβικό ακροδεξιό κόμμα του Γκέερτ Βίλντερς αναδείχθηκε πρώτη δύναμη στις χθεσινές εκλογές στην Ολλανδία, προκαλώντας έναν σεισμό που κινδυνεύει να γίνει αισθητός σε όλη την Ευρώπη. Δεν είναι βέβαια ακόμα σαφές εάν ο Βίλντερς θα μπορέσει πραγματικά να σχηματίσει κυβέρνηση, καθώς δεν αποκλείεται να συγκροτηθεί ένας ευρύς συνασπισμός των υπόλοιπων κομμάτων εναντίον του.

Response: Ολλανδία: Αναδείχθηκε πρώτη δύναμη το ισλαμοφοβικό ακροδεξιό κόμμα του Βίλντερς

Prompt: Answer in Greek the following question.

Question: Ποιες είναι οι 5 μεγαλύτερες ελληνικές πόλεις;

Answer: Αυτές είναι οι 5 μεγαλύτερες ελληνικές πόλεις: 1. Αθήνα 2. Θεσσαλονίκη 3. Πάτρα 4. Λάρισα 5. Κοζάνη

Prompt: Read the text below and answer the question using only information from the text.

Context: Ανακοινώθηκε οι διαιτητές της 12ης αγωνιστικής. Στο ματς της ΑΕΚ με τον ΠΑΣ στα Γιάννενα ορίστηκε ο Διαμαντόπουλος με VAR τον Μανούχος. Ο Πορτογάλος Γκουστάβο Κορέια θα διευθύνει το Άρης-Παναθηναϊκός. Πιο αναλυτικά: 16.00 Κηφισιά — Αστέρας Τρίπολης: Διαιτητής: Τσακαλίδης, Βοηθοί: Παπαδάκης, Κομισοπούλου , 4ος: Βάτσιος , VAR: Παπαπέτρου , AVAR: Φωτόπουλος

17.00 ΠΑΣ Γιάννινα — ΑΕΚ: Διαιτητής: Διαμαντόπουλος, Βοηθοί: Ψάρρης, Τριανταφυλλου, 4ος: Μαλούτας, VAR: Μανούχος , AVAR: Μεϊντανάς.

19.30 Λαμία — Ατρόμητος: Διαιτητής: Βεργέτης, Βοηθοί: Στεφανής, Κορώνας, 4ος: Φωτιάς, VAR: Τσαγκαράκης, AVAR:Σαμοϊλης.

20.30 Άρης — Παναθηναϊκός: Διαιτητής: Κορέια , Βοηθοί: Τιάγκο, Tεϊξέιρα, 4ος: Τσιάρας, VAR: Εστέβες , AVAR: Nικολακάκης.

Question: Ποιος διαιτητής ορίστηκε στον αγώνα της ΑΕΚ?

Answer: Διαιτητής ορίστηκε στον αγώνα της ΑΕΚ ο Διαμαντόπουλος

Prompt: Answer the question below.

Question: Πόσα γραμμάρια έχει το 1 κιλό?

Answer: 1000

(1-shot example needed to succeed here)

Prompt: Answer the following question.

Question: Ποια είναι η πρωτεύουσα του νομού Αρκαδίας; Answer: Η η πρωτεύουσα του νομού Αρκαδίας είναι η Τρίπολη.

Question: Ποια είναι η πρωτεύουσα του νομού Μεσσηνίας; Answer: Η πρωτεύουσα του νομού Μεσσηνίας είναι η Καλαμάτα.

Prompt: Read the context below and answer the question using only the information provided in the context.

Context: Στο πλαίσιο του αυτοφώρου, συνελήφθησαν την Τρίτη στον Κολωνό τέσσερις ανήλικοι, ηλικίας από 13 έως 15 ετών, καθώς κατηγορούνται- κατά περίπτωση- για εξύβριση, απειλή και απλές σωματικές βλάβες. Ειδικότερα, όπως έγινε γνωστό από την Αστυνομία χθες το μεσημέρι στο πάρκο της Ακαδημίας Πλάτωνος και ενώ ήταν σε εξέλιξη εκδρομή μαθητών σχολείων δευτεροβάθμιας εκπαίδευσης διαφορετικών περιοχών, οι κατηγορούμενοι επιτέθηκαν με υβριστικές εκφράσεις, απειλές και χρήση σωματικής βίας σε βάρος πέντε μαθητών (τέσσερα αγόρια και κορίτσι ηλικίας από 16 έως 17 ετών) του άλλου σχολείου. Στο σημείο μετέβη ασθενοφόρο, το πλήρωμα του οποίου παρείχε τις πρώτες βοήθειες στους τραυματισμένους μαθητές και οι αστυνομικοί της Άμεσης Δράσης που κλήθηκαν και έσπευσαν στο σημείο, συνέλαβαν καθ’ υπόδειξη των αρμόδιων καθηγητών τους μαθητές που βιαιοπράγησαν.

Question: Πόσoι ανήλικοι συνελήφθησαν στον Κολωνό?

Answer: Συνελήφθησαν τέσσερις ανήλικοι

A Greek LLM after fine-tuning Llama2 7b

Written by Vassilios Antonopoulos