Can LLM benchmark comparisons be trusted?Is LLM benchmark contamination ‘ Cheating’? Or is it an overlooked detail?
LLMs are trained on all data produced by humans, which raises concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets.
Two engaging papers published this week surface details and discuss solutions.
The average cost of training a 13 billion parameter model on 1.4 trillion tokens LLM is around $ 1.25 Million. Accuracy and timelines are of immense importance. How do you choose the right test sets
Two engaging papers published this week surface details and discuss solutions.
- Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [ Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica ]
- Don’t Make Your LLM an Evaluation Benchmark Cheater [ Kun Zhou,Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen and Jiawei Han, School of Information Renmin University of China, Gaoling School of Artificial Intelligence, Renmin University of China University of Illinois Urbana-Champaign ]
Join me as I delve into this intriguing topic and understand the compelling performance benchmarks of LLMs.
- LLM Benchmarking — 101
- Popular LLM Benchmark info Links.
- Popular LLM Benchmark Data sets source Links.
- Top 5 LLM Benchmark Dashboard Links.
- LLM Benchmarking Contamination — 101
- Comparative Data from recently published studies
- Recent recommendations and suggestions based on recently published studies.
- LLM Benchmarking 101
Benchmarking is like testing a new car to see how well it drives. Testers drive the car on different roads and in different conditions and record its performance. They also test the car’s safety features and its fuel efficiency.
What? LLM Benchmarking evaluates the performance of large language models (LLMs) on a set of predefined tasks. Metrics like accuracy, fluency, coherence, context comprehension, and speech recognition are used to compare the performance of different LLMs.
Why? LLMs are widely used in high-stakes fields like healthcare and finance. It’s crucial to understand their benchmarking performance and identify biases. For example, Microsoft, Amazon, and Google AI researchers use benchmarks to improve LLM performance and select the best LLMs for their users. Agencies use benchmarks to evaluate the performance and safety of LLMs before using them in mission-critical applications.
Who? Researchers, developers, and companies developing and using LLMs benchmark them. Benchmark organizations include the NIST, Text Analysis Conference (TAC), and SuperGLUE.
How? LLM benchmarking is typically done by developing a set of benchmark tasks and then evaluating the performance of LLMs on those tasks.
Tasks designed to assess a variety of LLM capabilities, such as:
- Language understanding: Can the LLM understand the meaning of text?
- Language generation: Can the LLM generate fluent, coherent,d and informative text?
- Translation: Can the LLM translate text from one language to another accurately?
- Question answering: Can the LLM answer questions accurately and comprehensively?
- Code generation: Can the LLM generate code that is correct and efficient?
2. Links to popular benchmarks used for evaluating Large Language Models (LLMs):
- AI2 Reasoning Challenge (ARC): This is a set of grade-school science questions, also used for testing LLMs like OpenAI’s GPT-4, Databricks’ Dolly, and Facebook’s LLaMA2.
- HellaSwag: This test of commonsense inference is easy for humans but challenging for state-of-the-art models.
- MMLU (5-shot): measures a text model’s multitask accuracy and is also used for understanding language complexities.
- TruthfulQA: for evaluating truthfulness.
- HumanEval: The HumanEval benchmark is a set of 164 programming problems created to evaluate the coding capabilities of large language models.
- MBPP (Mostly Basic Python Programming): The MBPP benchmark is a collection of 1,000 Python programming problems sourced from the crowd.
- TriviaQA (1-shot): This is another popular benchmark.
- BIG (Beyond the Imitation Game) -Bench Hard: This is also a popular benchmark.
3. Links to sources where you can download datasets for Large Language Model (LLM) benchmarking:
- LLMDataHub: This GitHub repository provides a curated collection of datasets designed explicitly for chatbot training
- GPT-Fathom: Another GitHub repository provides an open-source and reproducible LLM evaluation suite.
- Open-LLMs: This is a list of open LLMs available for instruction-tuning and alignment-tuning
- LLMZoo: This repository provides data for training LLMs.
- List of Open-Sourced Fine-Tuned Large Language Models (LLM): This is a blog post that provides a list of open-sourced fine-tuned LLMs
4. Here are some of the top LLM benchmark dashboards:
- The Big Benchmarks Collection — a open-llm-leaderboard Collection: This is a collection of various benchmark leaderboards on Hugging Face, including the Open LLM Leaderboard, MTEB Leaderboard, Chatbot Arena Leaderboard, LLM-Perf Leaderboard, Big Code Models Leaderboard, Open ASR Leaderboard, MT Bench, Toolbench Leaderboard, OpenCompass LLM Leaderboard, OpenCompass MMBench Leaderboard, and Open Ko-LLM Leaderboard.
- Open LLM Leaderboard — a Hugging Face Space by HuggingFaceH4: This is a leaderboard on Hugging Face that tracks, ranks, and evaluates open LLMs and chatbots.
- Large Language Models Leaderboard | Best LLM Models — Accubits: This leaderboard ranks LLMs based on their adoption and potential/capability.
- The ARC Benchmark: Evaluating LLMs’ Reasoning Abilities: This page provides information about the ARC benchmark and its leaderboard.
- LLM Leaderboards: How To Use Effectively — arize.com: This page provides information about how to use LLM leaderboards effectively.
Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”
5. LLM Benchmarking Contamination — 101
Here is a simple analogy to help you understand benchmark contamination:
Imagine that you will have a spelling test, and you know that the teacher will choose the words from a list of 100 words. If you study the list of 100 words before the test, you will have a much easier time passing the test. However, if other students cannot access the list of 100 words, the test will be unfair to them. Benchmark contamination is like studying the list of 100 words before the spelling test. It gives the LLM an unfair advantage and makes it difficult to compare it to other LLMs that do not have access to the same information.
Benchmark contamination is the presence of test data in a large language model's training or fine-tuning data (LLM). This can happen accidentally or intentionally, but it can have a negative impact on the reliability of benchmark results.
When an LLM is trained on test data, it can learn to memorize the test examples and regurgitate them when prompted. This can lead to artificially high performance on the benchmark, even if the LLM does not understand the underlying task well.
6. Comparative Data from recently published studies
Benchmark contamination has been shown to be a problem for many different LLMs and benchmarks. For example, a recent study found that over 90% of the examples in the Quac, SQuADv2, and DROP benchmarks were flagged as contaminated with GPT-3 pre-training data.
Recent studies have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets for large language models (LLMs). Most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, but these methods are found to be insufficient. Simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures.
For example, a 13B model can easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4, if such variation of test data is not eliminated. This has been observed in widely used benchmarks such as MMLU, GSK8k, and HumanEval.
To address this growing risk, stronger LLM-based decontamination methods have been proposed. These methods revealed significant previously unknown test overlap when applied to widely used pre-training and fine-tuning datasets. For instance, in pre-training sets such as RedPajama-Data-1T and StarCoder-Data, 8–18% of the HumanEval benchmark overlaps were identified. Interestingly, such contamination was also found in synthetic datasets generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination.
The community is urged to adopt stronger decontamination approaches when using public benchmarks. Moreover, the development of fresh one-time exams to evaluate models accurately is encouraged. A decontamination tool is publicly available for use.
Details Source Credit : Two engaging papers published this week surface details and discuss solutions.
- Rethinking Benchmark and Contamination for Language Models with Rephrased Samples [ Shuo Yang, Wei-Lin Chiang, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica ]
- Don’t Make Your LLM an Evaluation Benchmark Cheater [ Kun Zhou,Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen and Jiawei Han, School of Information Renmin University of China, Gaoling School of Artificial Intelligence, Renmin University of China University of Illinois Urbana-Champaig ]