Published in

Data Analysis Center

7 min readMar 5, 2024

In our previous article, we’ve covered usage of LLMs in geology. It has recipes for using open-sourced models, common gotchas and techniques for augmenting the model knowledge. We also describe potential use cases (general assistance, planning, document analysis), which we extensively tested in our workflow of seismic exploration.

Today we want to assess quality of open-sourced language models. You will learn about different types of metrics and benchmarks, and, finally, we’ll discuss results of our evaluation of popular networks.

The why

Before we begin, let’s consider whether we really need our own method to evaluate models. After all, there are already several comprehensive and thorough ratings (HuggingFace, Toloka). Why can’t we just rely on them?

The main part of the answer is specificity. We are interested not just in general performance; instead, we want to compare LLMs on tasks that are closely related to our geoscience domain. Usage of your own dataset leads to the need of self-evaluating performance metrics, giving a much better assessment of quality.

Furthermore, if the plan is to fine-tune models to a specific data, metrics are an absolute must. As usual in ML, having an automated numerical evaluation is crucial for making progress, and training models is meaningless without it.

The how

To start evaluating LLMs, we would need questions and corresponding reference answer(s). For this purpose, we’ll use the GeoBench dataset: a collection of such pairs, which was open-sourced by the creators of K2 model. Let’s look at an example from it:

Question: The umbrella theory explaining the Earth’s movement, contact, and flattening of large land plates is known as:
A. the Coriolis effect
B. plate tectonics
C. hotspots
D. the Richter Magnitude Scale
E. the subduction zone

What are the possible answers? Depending on the model, generated output may look like any of the following:

B
(B)
plate tectonics
B. plate tectonics
The correct answer to your question is (B) Plate Tectonics. This theory explains the movement, collision, and flattening of large land plates on Earth. The Coriolis effect (A) refers to the apparent deflection of objects (like wind or ocean currents) due to Earth’s rotation, while hotspots (C) are areas of intense volcanic activity caused by mantle plumes. The Richter Magnitude Scale (D) measures the energy released during earthquakes, and a subduction zone (E) is a region where one tectonic plate slides beneath another, leading to earthquakes and volcanic activity.

Even with all the answers being technically correct, the evaluation process is not easy. Depending on the context, a more concise or detailed answers are desired. Correspondingly, there are metrics for assessing either very short (based on accuracy) or wordier outputs (BLEU, ROUGE, …). So how do we judge our models?

Accuracy-based metrics

The easiest approach is to specify instructions more precisely. As usual, we use prompt for it, and by prefacing each question with “This is a multiple choice question. Answer with one letter and one letter only” we can narrow down the correct answers to exactly one.

This is the most common approach: the most popular benchmark MMLU is implemented exactly like that. Under the hood, hovewer, there is an additional parameter of metric computation, that depends on used tokens.

image from: https://blog.allenai.org/a-guide-to-language-model-sampling-in-allennlp-3b1239274bc3

Language models generate a probabilistic distribution for the next token over the set of known tokens. Given the sentence “2 + 2 =”, we hope that the probability of token “4” is higher, than for any other number-like token, but that does not eliminate other probabilities.
At the same time, we also hope that token “3” has a higher probability, than something completely unrelated, for example, token “the”. With the answer “3”, the output of a model would still be incorrect, but at least the question and the instruction are understood correctly.

For our evaluation, this means that there are multiple approaches. Token, corresponding to the correct letter, should have the highest probability among:

all of the model tokens (often referred as HELM implementation);
letter-tokens only (ORIGINAL implementation of this metric);
something in-between the two previous options: in our evaluation, we use all of the model tokens except technical ones (like symbols of linebreaks or generation start).

Those different options also correspond to somewhat different flavors of testing: the first one requires correct understanding of the instruction and correct output, while analysing only a few letters is much easier and assesses raw model knowledge.

Following the most common implementations, we also add a few examples in the model prompt and finish it with “Answer: “.
Our full implementation will be available shortly.

Benchmark

We have computed the accuracy metric (with all of the mentioned implementations) and ranked models according to it. At this point, we’ve included the most popular models (llama-2 family), trending now (Mixtral and miqu), as well as models that directly target geology (K2 and geogalactica). To put things into perspective, we’ve also evaluated some llama-1 LLMs. Take a look:

Evaluation results. Sorted by the rightmost (ORIGINAL) column.

Other that just saying “model A is better than model B on GeoBench”, we can see that:

as expected, values for HELM are considerably smaller, than for ORIGINAL implementation. This is due to the lower probability of correct token being chosen among all of the model tokens, not only the letter ones. The difference between those two columns can be seen as the difference between raw knowledge (ORIGINAL) and actually being able to apply it appropriately (HELM);
fine-tuned geological models (K2, geogalactica) have significantly worse ability to follow instructions, than their base LLMs. As we pointed out in the previous article, that is one of the common pitfalls of the model fine-tune procedures;
fine-tuned geological models are behind general ones by a landslide. Partly, this may be attributed to fewer parameters and not being able to follow the latest trends in architectures;
while table is sorted by the rightmost column (accuracy on letter-only tokens), model size is clearly a big factor. Also, the discrepancy between accuracy implementation is big for small models, which are notoriously worse at following instructions;
overall, progress during just one year (even less, if you think of it) is mind-blowing. If you compare the llama-2 and llama-1 in terms of error rates (25.6% vs 31.4%), there is almost a 20% improvement! Notably, this growth is even more pronounced for smaller sizes;
quantized models are losing a few percentages of accuracy, with small models losing more;
finally, Mixture of Experts models (Mixtral and miqu) are doing exceptionally well, while also being considerably faster.

It should be noted that this kind of benchmark is not perfect: just like the MMLU, the used dataset has some percentage of incorrect or incoherent examples. Nevertheless, until we approach very high (90+%) accuracy, we can still use it to rank LLMs.

We plan on regularly updating this benchmark with new model, so don’t hesitate to suggest which ones to add!

Metrics for generation

The accuracy-based metrics have a huge downside: they work exclusively with quiz-like tasks, where the correct answer is short, unambiguous and known in advance. When it comes to comparing quality of larger (and more diverse!) outputs, your options are quite limited:

BLEU and ROUGE are comparing word matches between target sentence and the generated one;
BERTScore and MoverScore are computing distance between embeddings of the target/generated sentences. Unlike the BLEU and ROUGE, these two are less sensible to synonyms;
perplexity and cross-entropy evaluate the model certainty in the answer, given the input sequence. These are the same ones, that are used during model training, and are more about confidence in the answer, not its quality.

As you can see, the nature of these metrics also limits their applicability — some are better for summarization, others for translations or raw generation. Also, when dealing with longer outputs, metric values can fluctuate a lot due to the sampling randomization.

The final metric we want to mention has a human in loop: LLM Arena. Essentially, it is an ELO-based leaderboard of models, both open-sourced and proprietary ones. Due to human feedback and closeness to the actual use cases of chat systems, this may be the most accurate estimate of the performance. If you have narrowed the set of tested model to just a few items, then this setup may be the best.

Conclusion

In this article, we’ve talked extensively about measuring model performance: mainly, on multiple answer questions. As we can see, open source models are outperforming specifically fine-tuned ones by a huge margin.

We hope to add new models into the table and update it periodically to reflect the progress in LLM community. If you have an idea of which model to include, be sure to leave it in the comments.

Now, as we know how to evaluate models properly, time nears for talking about actual fine-tuning: this will be the theme of our next article.

In either case, stay (sic!) tuned!