To what extent can LLMs speak Hungarian? Pt.2

More models, HuWNLI test, some thoughts on controllability

Péter Harang
7 min readJun 2, 2024

The positive responses for my previous story made me share with you my new results ahead of time. The basic process haven’t changed, but:

  • I’ve included the HuWNLI test, and tested all the modes against it
  • I’ve integrated new platforms, and their respective models into the test program

Please note, the test runs on limited number of test data, so the results should be taken with a pinch of salt. It’s rather a guideline than a statement.

What’s new

HuWNLI test

The last batch lacked the HuWNLI test. For anybody, who doesn’t know, this is the Winograd-schema’s Hungarian counterparts’ (HuWS) extension. The test data was published by the Hungarian Research Center for Linguistics, and it’s part of the HuLU package.

The gist of the test is to classify the possible connection between two very similar statements, where the meaning is modified by a few words. It’s much better written here. Here’s an example:

First sentences: Pisti mindenben Feri példáját követi. Nagyon befolyásolja őt. — Pisti follows Feri’s example in everything. He greatly influences him.
Second sentence: Pisti befolyásolja Ferit. — Pisti influences Feri.
Label: 0

…because the first sentences state that Feri influences Pisti. The test data focuses on the fine details, and the models must pay attention to recognize them.

Interestingly, the second sentence is always a more compact representation of the first, so it can be used for different tests, but let’s save that for other times.

AWS and Google platform, Titan and Gemini models

As the focus remains on the models available in big (enterprise-level services that can match outsourcing criteria for regulation-sensitive sectors) platforms, the two biggest must be in the tests.

Therefore, the test architecture looks like this as of now:

Ordnung muß sein — Watson is out, Google / Amazon / Nyelvtudományi Intézet in, local serving is not used (yet)

Note: Azure is not included, as the closed models there focuse mainly on OpenAI, which is already in the tests.

My story for the AWS integration can be read here, the Google is coming soon.

As for the models, the Google Gemini model-family, OpenAI’s GPT-3.5, and Amazon Titan model-family’s Premier model got included. The other Titan text models were rubbish (which is a common theme for english-only models), so their test was abandoned.

Updated test results

Let’s not wait and jump into the results. But first, here’s a little recap to understand what we’re about to see:

  • Test with the + sign shows that the metric is MCC (binary classification problem), and the * notes accuracy (where outputs can choose from more than two values).
  • The first column at the test show the results, and the second notes how many samples were unmeasurable (model didn’t respond in a way that could be evaluated).

Let’s dig in!

Openly accessible models’ test results

I haven’t anticipated much change in the ranking from the new test. Larger and newer models perform better thant the older / smaller ones.

The support for Llama2 model has been deprecated / and there’s no reason to continue testing it. It’s an interesting phenomena, that the models age very fast, and their upgrade / change should be anticipated in the systems and processes.

Results of closed ecosystems’ models

There baseline is still the Llama3 70B model, as it performed the best from the open models.

Results, rankings based on the simple average of the tests

The ranking is still poised with the not processable responses. I’m still thinking about how to represent this. Fortunately, it only counts at Titan Premier vs Sonnet and the Gemini 1.0 and 1.5 models.

You can easily get lost in the numbers, so here’s a visualization of the results, test by test:

MCC values are scaled to [0..1] range, so the same scale can be used with every chart

Interesting findings

The first thing that pops up, is the bad performance of GPT-3.5. It’s the most intreaguing for me, as the subjective feeling about the linguistically correct generation of the model is really good. Despite of that, it performed the worst in three of the understanding / classification tests.

The second interesting finding was the “degradation” of the Gemini models. After the 1.0 Pro model’s good results, the instability, and in more that one case, the performance drop of the 1.5 Pro model was a surprise.

The third note was the good start of the Titan Premier. It’s performance is analogous with Sonnet, but it showcased more stable performance, and behaved to the instructions more than any other model. I can’t wait to see what the Amazon models will bring to the table.

Controllability

I noted during the HuCOPA tests that certain models could’t be instructed in a proper way to respond only with a single number: in many cases the more specific prompts resulted in worse results. This behavior required to introduce a regexp-based cleaning, to have results that can be automatically processed.

Beside the definition of the task, I used three different instructions in the system prompt (to be exact, their Hungarian counterpart):

  • “Respond with x if…”, “only respond with x or y”, “don’t elaborate” — HuCOLA, HuWNLI
  • “Respond with x, if…”— HuRTE
  • “Start your answer with…” — HuCoPA, HuSST, HuCB

The different instructions were necessary because of the smaller models. They performed better with the lax approach of the task definition in the case of more complex tests.

Required responses vs poetic freedom

The trivial case of responding only with a single digit number is not interesting. On the other hand, we have the cases where the model:

  • starts the generation with a formatting character: **1** blablalblalbala
  • elaborates without an explicit request: 1, mert a szövegben az áll, hogy “Alison Hargreaves brit hegymászó lesz az első nő…
  • the evaluation is not on the beginning of the generation, but at the end: Mondat helyes: 1
  • answers correctly, but elaborates in English with line breaks: 1\n(The requested content is present in the original text.)
  • elaborates the correct answer, but doesn’t include the numeric notation (this is my favorite): A mondat helyes

It’s clear as a day, that the models can generate great variety of responses for not-thoroughly-formulated requests. The principle for programming is true for prompting as well: it will (try to) do as you’ve instructed it. But the principle has to be extended with: within the bounds of the task, it might do something that you have not forbade. This can’t be held against the model, this is the poetic freedom of the LLMs.

Results

Let’s check how well can the models follow up on our simple instructions:

The percentage of the responses that fit the prompt formatting rules

The results show two things: models generally handle the instructions well, but this is probabilistic in nature. When a model responds “mostly correct”, the programmatic integration will either fail, or you have to be prepared to post-process the LLM’s output.

I see examples with fine-tuned models that enforce output formatting. But it will eventually find it’s way in the foundational models — Gemini was advertising itself behaving this way (forced JSON output).

Wishful thinking

The conrolling nature of the test were aimed to retrieve only numbers. At least that the intention.

This approach entails that we can limit the generated maximum number of tokens to 1 or 2. This would act as a safety net which comes handy with models that can handle large context windows.

If I re-calculate the outputs of the model with this in mind (so it only counts as good, if the output and the cleaned output is equal), the results change a bit:

The LLM responds with only one number

There are some models that figure out what we intended to do, despite of the lax instructions. Note the 100% resulst at the HuCoPA, HuSST and HuCB tests.

Summary

The introduction of new platforms and models makes the results more interesting. The bad performance of the GPT-3.5 model is an oddity, this has to be investigated further.

The behavior for controlling attempts is an interesing phenomena, and renders a rule that must be followed: we can’t expect that the model responds the exact same way we thought we have instructed, and it’s mandatory to prepare for such responses.

Tests returned additional data as well, I’ll continue with evaluating them in the follow up story.

--

--

Péter Harang

I design and build complex, heavily integrated IT ecosystems for the banking sector, focusing on e-channels and AI