To what extent can LLMs speak Hungarian?
Unless we test a lot, we won’t find out.
Note: this is the translation of my original Hungarian story, which can be found here.
Lately, we’ve been approached by our customers to use some kind of AI in our solutions. Most of the time, it’s about introducing a RAG system, so there’s nothing new. But the interesting, and new part is that the system should communicate in Hungarian, whereas the current implementations and examples are tend to excel on the English-speaking markets.
In our discussions, there’s a recurring theme of “I have already tried xyz model at home, and it speaks Hungarian quite well”.
Unfortunately, these subjective samplings don’t have much merit, and HumanEval for Hungarian simply does not exist. But it gets worse: there’s no leaderboard for test focusing on Hungarian language capabilities. Therefor it came the idea to create one, or at least try to measure them.
The vision is to have something akin to automated (unit) tests, to run on new models, producing an objective measurement to indicate whether it is worthy of testing the capabilities with more exhaustive (human) tests.
Hungarian Language Understanding test
Obviously, I won’t start to generate tests by myself, I’m not a linguist. But Google is my firend, and we’ve stumbled upon the HuLU package of the Hungarian Research Center for Linguistics and started to play around with the data. The package consist of multiple tests:
- HuCOLA: checks for linguistic acceptability.
Example: “Az Angliáról való könyv tetszik.” => false
Translation: I like to book from the England. - HuCoPA: checks for correct choosing between plausible alternatives.
Example: “A fürdőt elöntötte a víz.” Lehetséges ok: 1 — “A WC túlcsordult.”. 2 — “Elromlott a bojler.” => 1
Translation: “The bathroom has been flooded” Possible alternatives: 1- the toilet has overflown. 2-the water heater malfunctioned - HuRTE: checks for recognizing textual entailment.
Example: “Még nem találtak tömegpusztító fegyvereket Irakban.” Hipotézis: “Tömegpusztító fegyvereket találtak Irakban.” => false
Translation: WMDs haven’t been found in Iraq, yet. Hypothesis: WMDs have been found in Iraq - HuSST — Hungarian Stanford Sentiment Treebank, examining the sentiment of the sentence on a positive/neutral/negative scale
Example: “Hatásokkal teli, de túl langyos filmbiográfia.” => negative
Translation: A film biography full of impact, but ultimately too lukewarm. - HuCB — Hungarian Commitment Bank: examines the connection between the base sentence and the hypothesis on a contradiction/neutral/entailment scale
Example: “ ‘Piroska néni pályafutása végére teljesen megőrült’ Kár érte! Az énekóra volt a csúcs. De ne képzeljétek el, hogy csütörtök, 6. órában!” Felvetés: “A beszélő szerint csütörtökön 6. órában jó volt az ének óra.” => contradiction
Translation: ‘Aunt Piroska completely lost her mind by the end of her career.’ It’s a pity! The singing class was the highlight. But don’t imagine it was during the 6th period on Thursday!” Assertion: “According to the speaker, the singing class was good during the 6th period on Thursday. - HuWLNI — it’s currently not in my scope, maybe later.
It’s important to note, that these tests are focusing on recognizing grammatical correctness, and not on generation capabilities. Therefore, I had to make a fundamental assumption:
Hypothesis: in case of LLMs, there’s a strong relation between recognizing grammatical correctness and grammatically correct generation capabilities
My assumption is based on the observation that LLMs tend to exhibit similar cross-over behavioral patterns, worthy for testing. And otherwise, one has to start somewhere…
Approaching the problem
If we want to orchestrate an automated test, we have to examine the flow of information and the tasks we need to commit. Let’s start at the beginning.
Data
We have different datasets within HuLU, which all have different data-structures. Their common ground is having the data in JSON format.
This is optimal because of character encoding standpoint, as Hungarian text processing always suffer from figuring out which one to use. As this question is eliminated, it points out that we have to prepare ourselves to handle multiple input structures.
Models and platforms
It’s clear, that a comprehensive test requires multitude of models to be examined. The implementation- and empirical experience shows that closed models focus more on multi-language capabilities, but these are (except for a few) can only be used on their own proprietary platform. It brings joy to my heart to note, that every platform has it’s own API to use.
What we need to take into consideration during the implementation and test execution:
- parametrization of parallelism
- throttling settings
- timeout handling
- platform-specific authentication solutions
These doesn’t make my life easier.
Prompts
The HuLU dataset consists of six different tests. For the evaluation, I prepared an execution prompt for each one of them along these lines:
- The language of the prompt is Hungarian. It is expected for a proficient LLM to be steered in the language of the desired output. If the model fails in this task, it won’t be good in production either.
- There’s no overcompliated / overengineered prompting. The aim of the test is to observe the models’ capabilities relative to each other, an NOT for squeezing out the maximum capabilities with model-specific prompts.
- The LLMs are instructed to return their answer in numerical form. We don’t what the exact labels to be returned, as it hinders the answer-cleanup process. Also, this introduces some kind of transformational task as well.
I made sure that the small models would be able to answer each type of test correctly by carefully pre-testing the prompts. I expect the larger models to effortlessly fulfill the desired task.
Cleanup process
It turned out during the test phase, that as the tasks were getting more complex, it was way harder to instruct the modell to respond with a single digit answer. A cleanup process was introduced in a form of a regexp, which filters the first number in the response. This is analogous what we would do in a production-ready system, so with this help, we won’t commit a major crime.
Let’s bring all together
I’ve prepared a TypeScript solution for running the tests on the different source data, their relevant prompt, the different platforms and the multitude of models:
The results of the test are saved in Excel files, as during the manual examination it’s easier to manage / visualize the data:
Test results
Ok, let’s see the results. The + sign shows, that the measurement was calculated as MCC (for binary classification problem), and * sign points to have the accuracy measured (where the output may wary between more than two possible answers).
It’s important to note, that these test were run on a partial dataset of the first 100 record of the training data. This was because I was focusing on the quantity of the models, and to try to visualize some trends that might appear.
Freely accessible models’ test results
I don’t expect too much from the freely accessible (and/or open source) models. Experience shows that their capabilities go along the line of “yes, they know something, but their generation capabilities resemble an older version of Google translate”. Therefore, I just like to include them to watch their capabilities grow.
The reference point for the small models is located in the first row, named random reference: this shows, what happens if we answer in totally random way. (The alteration from the statistical expectation is due to that tests’ labels don’t follow the uniform distribution.)
The first column of a test shows the quality of the model, the second number shows the proportion of incomprehensible results (the model did not respond in a way to be able to extract a numerical value within the expected range).
Results of closed ecosystems’ models
The interesting match is between OpenAI and Anthropic, as everybody has a good experience with ChatGPT, and from the most part, the Opus model is considered on-par (or even better) with GPT-4.
The reference point in this case is not the random, but the best openly available model (Llama3 70B).
IBM’s Granite model (which is purposefully trained on English-only corpus) fails on the simplest Hungarian test, therefore further testing was abandoned.
Evaluation
Model behavior
When you run 65 tests, you’ll meet with surprises:
Amikor lefuttatsz 65 tesztet, akkor szükségszerűen érnek meglepetések:
- The Gemma 7B model (rolled out earlier this year) responded exceptionally good despite of it’s size.
- Sonnet and Granite models didn’t live up to the expectations, they performed simply badly.
- A Sonnet és a Granite modellek a tőlük elvárthoz képest nagyon rosszul tejesítettek
- The Llama3 70B modell (not counting one test) outperformed the Anthropic Haiku model, which I consider a great feat.
The most important phenomenon would be that the difference between the old (Llama2) and the new (Gemma / Llama3) models are enormous. Despite being one-tenth of size, they are able to generate better results, and this holds true in case of the smaller models that have been fine tuned with Hungarian specific corpus (Puli Llumix).
The most importan revelation is that the openly available models’ qualty doesn’t even come close to the closed ones. This is a classic case of “we knew, but we didn’t realized”. Now, we have numbers to show off.
Test quality
I’ve noticed that several models consistently generated incorrect answers for certain test data. I’d find it useful to examine these case-by-case, but honestly, I don’t feel the strength or knowledge within me to contradict the experts who prepared the test data.
Another important observation occurred with tasks involving more than two options (such as HuSST, HuCB): the models consistently misclassify the neutral-labelled tests.
Whereas negative and positive test cases are easily (and sometimes flawlessly) recognized. This pattern repeats at each model.
Summary
At he current state, I cannot yet answer the fundamental question, as this is just the beginning of the journey. However, it is promising that the subjective experiences are reflected in the tests: GPT-4 and Opus models are orders of magnitude better than any other. Based on this, I am convinced that I am on the right track.
Please wait for the follow up…