ProLLM: LLM Benchmarks for Real-World Use-Cases
The true measure of a Large Language Models (LLMs) capabilities lies beyond the realm of academic benchmarks. Consider the complexities of a customer support chatbot: it must first sift through and leverage a vast array of relevant information related to a customer’s issue. Then, it needs to navigate a cohesive conversation over several rounds, remember past interactions, and dynamically adapt its responses to the unfolding dialogue — all in the service of effectively resolving customer issues. This dynamic setting demands far more than the ability to answer multiple-choice questions, a skill commonly evaluated by academic benchmarks such as MMLU. Additionally, these benchmarks, often compiled by students from various online sources, may include questionable content.
In response to the need for more accurate assessment of LLMs in complex, real-world interactions, we have started creating ProLLM benchmarks derived from the diverse use cases of Prosus Group companies, which touch on sectors ranging from EdTech to food delivery and marketplaces. Our aim is to test LLMs against the kind of interactions they will encounter in everyday use as they are powering real products with real customers, not just theoretical ‘classroom’ scenarios.
We built ProLLM benchmarks on four principles to ensure meaningful and applicable results:
- Usefulness: We create benchmarks directly from real use-case data and adopt meaningful metrics to measure how well they perform, providing actionable insights into their effectiveness.
- Relevance: Our benchmarks are the only ones that enable interactive exploration of LLM performance across multiple aspects, tailored to user interests, such as answering advanced debugging questions for JavaScript code.
- Inclusivity: Our benchmarks are inclusive and dynamic, covering a variety of applications in multiple sectors and languages and evolve with the model capabilities and requirements of use cases.
- Reliability: Our evaluation sets are not publicly disclosed, ensuring the benchmarks’ integrity, with mirror sets shared for insight and transparency.
Building on these principles, we benchmarked over 30 leading LLMs, including both proprietary and open-source models, across a variety of tasks. Our initial effort focused on the coding assistant use case leveraging data from Stack Overflow. We are actively expanding our benchmarks using the diverse evaluation sets collected from our group’s companies. This blog provides a preview of our benchmarks. Our interactive benchmarks portal is also live and accessible to the public. Details of our benchmarks and evaluation strategy will be published soon in a paper along with mirror datasets and evaluation scripts.
The Reality Check for LLM Benchmarks
The Mismatch
Current benchmarks for LLMs, such as Hugging Face’s LLM Leaderboard, which focuses mostly on text completion and multiple-choice question answering, are well-suited for demonstrating basic language understanding. However, to better meet the demands of real-world applications, there is a growing need to include benchmarks that assess a model’s proficiency in more practical tasks, such as summarizing key points from a financial report. LMSYS Chatbot Arena offers an engaging way to evaluate LLMs through community-driven, randomized battles, using the Elo rating system for comparative analysis. This innovative approach highlights relative model strengths effectively. However, its focus on subjective assessments and comparative rankings may not adequately address specific real-world scenarios, limiting its practical applicability across diverse applications. Stanford’s Helm benchmark, while primarily featuring Q&A evaluations, takes a step in the right direction with its LegalBench tasks that more closely mirror the work of legal professionals. However, it only scratches the surface of potential applications. Beyond this, the broader landscape of industry-specific LLM tasks remains significantly underrepresented in benchmarking efforts. Moreover, the vast majority of benchmarks, including those for translation, are conducted in English.
In software development, benchmarks for evaluating LLMs are few and don’t quite capture the full spectrum of a programmer’s day-to-day tasks. Consider HumanEval, the widely-used coding benchmark, which provides 164 Python coding exercises that are clean-cut and self-contained. These tasks miss the complexity of real-world software development, such as debugging, and code optimizing, where developers often work with a variety of programming frameworks and libraries. Compounding this issue is the prevalent focus on Python across these benchmarks. While Python is a popular choice, as confirmed by its use among 49% of professionals in the latest Stack Overflow survey, the industry’s reliance on a diverse array of programming languages is not reflected. The mismatch between the current benchmarks and industry requirements can be attributed to several underlying factors:
- The rapid pace of LLM development has outstripped the evolution of benchmarks. Benchmarks, mostly created by researchers, often focus on general and more fundamental linguistic tasks that aren’t necessarily representative of complex, industry-specific problems.
- Initially, language models were primarily an academic pursuit. Only recently have industries significantly invested in leveraging these models, driven by the popularity and success of OpenAI’s ChatGPT.
- Evaluating LLMs with human experts is intensive and not scalable, especially with the breadth of LLM tasks and the constant influx of new models. In response, the field relies on structured tests and surface-level metrics, often focused on literal text matching and multiple-choice answer selection, which do not fully capture the depth of real-world model performance.
- Creating high-quality, industry-specific evaluation datasets is a non-trivial and complex task, requiring extensive resources, the expertise of domain specialists, and data that accurately represent real-world scenarios while meeting strict quality and privacy standards.
Aligning Benchmarks with Business
Our mission is to redefine the evaluation of LLMs in a way that our benchmarks clearly show their practical usefulness for business in a reliable way. To achieve this, we have concentrated our efforts in the following areas.
Crafting Realistic Evaluation Sets
To create benchmarks that align with real-world demands, we’ve conducted a thorough analysis of how LLMs are used across Prosus’s vast portfolio. This portfolio comprises over 100 companies worldwide, including Stack Overflow, Udemy, iFood, Swiggy, and OLX. We discovered that LLMs are commonly utilized in higher level for coding assistance and gaining insights into complex documents, as well as a support agent for addressing work-related queries. In lower level, LLMs are often leveraged as a convenient zero-shot machine learning resource for functions like extracting data insights through tagging, classifying, and entity extraction. These common use-cases have been prioritized in the creation of our evaluation sets to keep our benchmarks relevant for business.
Preventing Benchmark Data Leakage
One of the main hurdles with current public benchmarks is the unintentional leakage of their data into the training sets of LLMs. This data contamination significantly affects their performance, as demonstrated by the performance difference between benchmarks of coding assistance task on historical versus recent Stack Overflow data, shown in Figure 2.
To address this, we’ve implemented a two-step approach for creating our evaluation sets. First, we keep our main evaluation sets private and use them to report our benchmark results. Additionally, we ensure that the LLMs used for automated evaluations are deployed on cloud platforms that give a zero-retention guarantee for our prompts.
Second, we will offer public mirror evaluation sets that closely match our private sets. These public sets allow LLM developers and other interested parties to test their models on evaluation sets that are similar to our private ones. We are also dedicated to regularly updating our evaluation sets to ensure their quality and relevance.
Aligning Evaluation Metrics
Over the past three years, our extensive deployment and testing of various LLM use-cases have consistently shown that standard, one-size-fits-all metrics are inadequate for capturing the specific needs and complexities of LLM applications. Each LLM task is unique, requiring customized evaluation criteria that align closely with its individual objectives. These customized metrics are essential for evaluating the model’s performance in line with user expectations.
To guide decision-makers effectively, these metrics must be intuitive and transparent, distinctly showcasing the model’s success in practical applications. For question-answering tasks, for example, the core metric we focus on is the Acceptance Rate. Inspired by the accepted answer concept on Stack Overflow, we refined this metric to evaluate answers on three important dimensions:
- Accuracy: Is the answer correct, and does it avoid significant inaccuracies that could undermine its validity?
- Completeness: Does the answer cover all essential points thoroughly?
- Relevance: Is the answer directly applicable to the core problem posed by the user?
An answer must score well across all these dimensions to be deemed acceptable. If an answer falls short in any of these areas, leading the user to continue searching for a solution, it is considered unacceptable.
For summarization tasks, the metric of Adherence to Instructions is particularly significant. It measures how well the LLM output aligns with the user’s specific instructions. For instance, if a user requests three key points, the LLM should ideally provide exactly three key points, assuming they are available in the source material. Additionally, the output must be accurate, ensuring that the content is grounded in the original text. The writing quality should also be high, maintaining clarity and readability throughout.
Such tailored metrics are designed to capture what truly matters to the users of these applications.
Automating the Evaluation Process
To scale up the LLM evaluations, we’ve developed an automated framework that leverages the technology itself. However, using LLMs as evaluators isn’t as simple as it seems. Even the most advanced model to date, GPT-4 Turbo (as of April 2024), may still fall short of domain experts when it comes to complex tasks like evaluating technical answers.
It is known that to truly improve anything, one must begin by accurately measuring it. With this in mind, we developed an auto-evaluation benchmark using a set of 150 diverse questions gathered from Stack Overflow and our internal AI assistant Toqan, which has processed millions of queries since 2022. We then used different LLMs to generate answers for these questions, which were evaluated and scored by at least three domain experts according to a detailed rubric we defined. This process created a labeled dataset that allows us to accurately quantify the performance of LLMs in evaluating answers.
Auto-evaluation benchmark has been instrumental in testing our evaluation techniques and facilitating improvement. Moreover, it guided us in establishing rules for creating evaluation sets that are fit for automated assessments, ensuring, for instance, that all questions are self-contained.
Initially, the highest accuracy levels achieved with our top models ranged from ~70% on our evaluation benchmark. After extensive experimentation and numerous refinements to our automated evaluation methods, we significantly increased the evaluation accuracy above 85%. As illustrated in Figure 1, this level of accuracy now surpasses that of the average domain expert.
Benchmarking Across the Spectrum
We’ve established an automated evaluation pipeline that enables us to benchmark any new model on the day of its release. This ensures we keep pace with the rapidly expanding field, which includes commercial LLM providers like OpenAI, Anthropic, Mistral and Google, as well as contributions from the open-source community, potentially featuring specialized models in code or other fields.
Our initial focus lies on a set of coding and Q&A assistant evaluations designed to assist our product teams, both within Prosus and our portfolio companies, including Stack Overflow. We are actively working on expanding our scope of benchmarks in both quantity and diversity. Here’s a snapshot of selected LLM tasks under our evaluation:
Coding Assistant
Our Coding Assistant benchmark consists of 925 Stack Overflow questions asked over the last five years, covering 25 programming languages and four task types: debugging, implementation, optimization, and conceptual understanding. Each question is selected based on its positive score and multiple high-rated answers, including at least one accepted solution, ensuring reliable evaluations against verified solutions.
The Coding Assistant — Recent benchmark includes 300 high-quality questions from the past six months (as of October 1, 2023), testing models’ adaptability to new, unseen queries. Significant performance gaps are evident, with GPT4Turbo’s accuracy dropping from 90% in the Coding Assistant benchmark to 64% in the Coding Assistant — Recent benchmark. To facilitate fair comparisons in the Coding Assistant — Recent benchmark, we implemented a Version tag to filter out questions influenced by recent software updates. Despite these adjustments, GPT4Turbo’s performance stabilizes at 76%.
Additionally, it’s important to note that GPT4, with a training cut-off in April 2023, experiences an even more significant performance drop in the Coding Assistant — Recent benchmark, falling from rank 6 to rank 13. This decline highlights the challenges models face with rapidly evolving content and missing the recent Stack Overflow content in their training data.
Q&A Assistant
We’ve developed an in-house AI assistant, Toqan, designed to act as an AI colleague, providing support to our colleagues across the Prosus portfolio companies. Since its launch in 2022, Toqan has addressed over 1.1 million varied queries from thousands of users worldwide. Leveraging this extensive interaction data, we’ve carefully curated an evaluation set of 275 questions with permissive rights for benchmarking. This set includes a balanced mix of technical questions, primarily related to software engineering and IT, and non-technical questions akin to business consultancy. The Q&A Assistant Benchmark evaluates an LLM’s effectiveness as a team member in a business environment by assessing its ability to provide accurate and contextually relevant responses. To ensure a deeper evaluation, we’ve categorized the questions based on their complexity level and type as seen in Figure 3. Each question in our evaluation set has a corresponding answer generated by GPT4Turbo, which has received positive ratings and has been validated by domain experts.
Other Benchmarks
We have recently introduced a Summarization Benchmark that evaluates an LLM’s ability and different long-text handling techniques to accurately summarize and extract key insights from diverse sources such as YouTube video transcripts, websites, PDFs, and direct text inputs. This benchmark utilizes approximately 50 user-submitted queries to our Toqan assistant, testing the AI’s proficiency in following detailed instructions to produce concise summaries and perform targeted information extraction tasks.
Additionally, we are preparing multiple evaluation sets to be released, ranging from an SQL agent to data insights in non-English languages. Keep an eye on our prollm benchmarks portal for updates on their availability.
Benchmarking Portal
We are proud to provide prollm, our benchmarking portal, as a public and open-access platform, which delivers comprehensive insights into the performance of various LLMs — and soon, agents — across diverse real-world scenarios. It features interactive exploration tools that enable users to thoroughly investigate specific tasks and compare the effectiveness of different models. Users also have the opportunity to create customized benchmarks tailored to their precise evaluation needs. We are dedicated to consistently updating the portal with new benchmarks from both the Prosus ecosystem and external sources, ensuring it keeps pace with the rapidly evolving LLM landscape.
Why Interactivity in Benchmarks?
Interactive benchmarks offer a dynamic view into the nuanced performance of LLMs, revealing how their effectiveness can vary significantly across different tasks. As a sneak peek into our benchmark insights, we’ve observed that while GPT4-Turbo consistently outperforms other models across many tasks, the degree of its superiority fluctuates.
For instance, in overall coding assistant tasks, GPT-4 Turbo stands out, performing considerably better than all other models. However, when it comes to answering C and C++ questions, the performance landscape changes. GPT-4 Turbo’s performance drops noticeably, and we also observe a shift in the leaderboard rankings as seen in Figure 4. We see a similar shift in Javascript debugging questions as well.
These insights are just a glimpse of the in-depth analyses available on our benchmarking portal. The site serves as a resource for anyone interested in the nuanced performance metrics of LLMs across a range of real-world scenarios. By providing a detailed view of how various models perform on different tasks, our benchmarks can guide users to choose the most suitable LLM for their specific needs.
Building Better LLM Benchmarks Together
We invite organizations worldwide to enhance the benchmarking landscape for LLMs by contributing your unique use cases. Stay informed with real-time performance updates each time a new model is released, allowing your business to effortlessly keep pace with AI advancements. This collaboration is not just a call for partnership; it’s a chance to join a growing AI ecosystem. Together, we can expand our collective knowledge and influence the development of LLMs across industries. Join us to share in this knowledge and gain recognition for your contributions to AI..
Acknowledgment
All the work presented in this blog — from research, data set curation and tagging, to inference and evaluation pipelines, prompts and interfaces — has been carried out by the Prosus AI Applied LLM team: Nidhish Shah, Doğu Aracı and Zülküf Genç.
We wish to extend our gratitude to our colleagues at Stack Overflow: Ellen Brandenberger, Michael Foree and his team. Their assistance with the creation of the Stack Overflow evaluation sets and their constructive feedback have been instrumental to our initiative. We are also grateful to Euro Beinat and Paul van der Boor for their valuable feedback and for reviewing this blog post. Their insights greatly improved the quality of our work. Lastly, we express our heartfelt appreciation to the team members at Prosus AI and Toqan. Their contributions to the labeling of our LLM-generated evaluations provided significant assistance to our project, for which we are truly grateful.
Stay Tuned
We are excited about the potential of our efforts to influence the future of LLMs and their real-world uses. If you’re intrigued by our benchmarks and wish to contribute, or if you have any questions or would like to share your feedback, don’t hesitate to contact us at zulkuf.genc@prosus.com. For updates on our continued work and new benchmarks, make sure to follow our blog and prollm portal. Thank you!