Benchmark of LLMs (Part 2): MMLU, HELM, Eleuthera AI LM Eval

Michael X
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
6 min readSep 25, 2023

To read the previous articles of the “Evaluation of LLM” series, please click:

Measuring Massive Multitask Language Understanding (MMLU)

In the MMLU[15], the authors introduce a comprehensive new test designed to evaluate text models’ multitask accuracy and capabilities across a broad range of subjects. This test comprises 57 tasks that cover diverse fields such as elementary mathematics, US history, computer science, law, and many more. The goal is to assess a model’s world knowledge and problem-solving abilities in these varied domains.

Figure 7: Performance with different sizes of language models

Figure 7 illustrates the performance of different model sizes on a commonsense benchmark (HellaSwag[16]), a linguistic understanding benchmark (SuperGLUE), and the massive multitask test (MMLU). While smaller models show significantly better performance than random chance on previous benchmarks, with a continuous improvement as the model size increases, GPT-3 only surpasses random chance performance with its largest model when it comes to the MMLU.

This difference in performance underscores the increased complexity and diversity of the MMLU, which requires extensive world knowledge and problem-solving abilities across a wide array of subjects. Smaller models struggle to demonstrate significant competence in this challenging benchmark, emphasizing the need for more advanced and larger models to address the test’s broad scope and varying difficulty levels.

The fact that only the largest GPT-3 model moves beyond random chance performance on the MMLU highlights the gap between current benchmarks and the actual capabilities of language models. It raises questions about the scalability of these models and their potential to achieve human-like performance across various tasks and domains. This observation offers valuable insights for researchers, motivating them to develop more sophisticated models and improve AI’s understanding and problem-solving capabilities.

Figure 7 also reveals that, while most GPT-3 models exhibit near-random chance accuracy, the largest GPT-3 model demonstrates a considerable improvement, outperforming random chance by nearly 20 percentage points on average. Despite this progress, even the top-performing models have significant room for improvement before they can achieve expert-level accuracy across all tasks.

The MMLU covers a wide range of subjects, including STEM, humanities, social sciences, and other specialized areas. The researchers found that GPT-3’s performance is uneven across subjects, with the model excelling in certain areas while performing at near-random levels in others. Notably, tasks involving heavy calculations, such as physics and mathematics, as well as subjects related to human values like law and morality, pose significant challenges for the model.

Another key finding of the MMLU is that GPT-3 lacks a reliable understanding of its own knowledge and limitations. The model’s average confidence can deviate substantially from its actual accuracy, indicating that it often fails to recognize when it is incorrect or uncertain.

The proposed test serves as a valuable tool for analyzing models across a wide array of tasks and identifying crucial areas for improvement. By leveraging this test, researchers and developers can work towards creating models with enhanced language understanding and problem-solving skills, ultimately driving progress in the field of artificial intelligence.

HELM

This paper introduces the Holistic Evaluation of Language Models (HELM), a novel approach designed to enhance the transparency of language models by providing a comprehensive assessment of their strengths, limitations, and potential risks. HELM employs a two-tiered methodology that comprises an abstract taxonomy of scenarios and metrics, along with a concrete set of implemented scenarios and metrics that emphasize coverage, value, and feasibility.

Figure 8: HELM

Figure 5 highlights the significance of the HELM taxonomy in comparison to traditional language model benchmarks such as SuperGLUE, EleutherAI LM Evaluation Harness, and BIG-Bench. These previous benchmarks rely on datasets with a standard task framing and a canonical metric, usually accuracy. HELM, on the other hand, employs a top-down strategy that explicitly identifies its evaluation goals (i.e., scenarios and metrics) using a well-organized structure. This approach facilitates informed choices regarding the implementation of specific scenarios and metrics, thus exposing potential gaps in coverage, such as languages other than English.

HELM assesses language models across 16 core scenarios and 7 metric categories. The core scenarios include 6 user-oriented tasks such as question answering, information retrieval, summarization, and toxicity detection, spanning a range of domains and English dialects. The 7 metric categories encompass accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency.

In addition, the study carries out 7 targeted evaluations via 26 supplementary scenarios to thoroughly investigate particular aspects, such as linguistic understanding, world and commonsense knowledge, reasoning abilities, memorization and copyright concerns, disinformation generation, biases, and toxicity generation.

The authors evaluate 30 prominent language models from 12 major organizations, including AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft/NVIDIA, OpenAI, Tsinghua University, and Yandex. This examination reveals the diverse performance levels of these models across various scenarios and metrics, underscoring the necessity for a more holistic evaluation approach.

With the goal of serving as a dynamic benchmark for the community, the HELM evaluation framework will be regularly updated to incorporate new scenarios, metrics, and models. The authors also release all raw model prompts and completions for further analysis, as well as a versatile modular toolkit that simplifies the process of adding new scenarios, models, metrics, and prompting strategies.

EleutherAI LM Eval

The evaluation of Large Language Models is crucial for understanding their capabilities and limitations, especially as they become central components in NLP and NLU pipelines. Evaluating LLMs, however, presents significant challenges due to their training on massive amounts of data and their ability to perform tasks in zero-shot, one-shot, and few-shot settings. A robust, standardized evaluation framework is essential for comparing and assessing the performance of these models.

The lm-eval package, released by EleutherAI, is a comprehensive evaluation framework for LLMs, including popular models like GPT-2, T5, GPT-J, GPT-Neo, GPT-NeoX, and Flan-T5. With a flexible and tokenization-agnostic interface, the library offers a single framework for evaluating and reporting auto-regressive language models on a wide range of Natural Language Understanding (NLU) tasks. The package currently contains over 200 evaluation tasks, covering various NLU capabilities such as question answering, summarization, sentiment analysis, machine translation, commonsense reasoning, and more.

The framework allows for task development and customization, enabling users to create new tasks or modify existing ones to suit their specific evaluation needs. It also supports task versioning, ensuring that the evaluation results are reproducible and that any changes to task definitions are clearly indicated. Furthermore, the lm-eval package includes a decontamination tool that helps in measuring the impact of test set leakage by detecting contaminated test examples and producing a clean version of the benchmark.

Although automatic evaluation provides a way to evaluate, compare, and measure the performance of language models, it’s essential to acknowledge the importance of human evaluation in assessing aspects such as creativity, humor, and engagement.

In summary, the lm-eval package offers a robust and standardized method for evaluating LLMs, which is crucial for understanding their performance and guiding their development in areas where they are not yet performant.

Reference List:

[15] Hendrycks D, Burns C, Basart S, et al. Measuring massive multitask language understanding[J]. arXiv preprint arXiv:2009.03300, 2020.

[16] Zellers R, Holtzman A, Bisk Y, et al. HellaSwag: Can a machine really finish your sentence?[J]. arXiv preprint arXiv:1905.07830, 2019.

[17] Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models[J]. arXiv preprint arXiv:2211.09110, 2022.

This is a series of articles about “Evaluation of LLM”. Please stay Tuned!

--

--