An Evaluation Framework for AI model performance —OpenAI Evals.

Dayanithi
Artificial Corner
Published in
3 min readMar 28, 2023

A framework for creating and running benchmarks for evaluating models like GPT-4 while inspecting performance sample by sample.

Photo by Sven Mieke on Unsplash

Everyone talks about all the amazing projects that OpenAI makes. But for them to be amazing they have to be tested and evaluated before launching. Now for DALLE or GPT to be what they are today have to be evaluated rigorously. Therefore all these products are only as good as the evaluation processes they have been put through. Imagine how good these metrics have to be.

This move of making this software framework open-source will speed up the solution to possible issues that might come to light after certain benchmarks and evaluations. Evals are mainly used for the development of Large Language models by identifying shortcomings and preventing regressions. Since the code is all open-source, Evals supports writing new classes to implement custom evaluation logic. Users can now apply it to track performance across model versions and product integrations. Where a large number of startups and companies that adopted using AI to better their product emerged and with Evals and GPT API, they could seamlessly integrate this into their product. Some of such companies are :

  • Duolingo: To deepen conversations on their platform for language learners. Duolingo is using these new features in Spanish and French for now, which is set to expand to other languages sooner.
  • Stripe: A financial services and SaaS company leverages GPT-4 to streamline user experience and combat fraud.
  • Morgan Stanley wealth management: deploys GPT-4 to organize its vast knowledge base.
  • Be My Eyes: uses GPT-4 to transform visual accessibility. Be My Eyes began developing a GPT-4 powered Virtual Volunteer™ that can generate the same level of context and understanding as a human volunteer using it’s new visual input capability.

And also Iceland government is using GPT-4 to preserve its native language.

They have also invited devs to use OpenAI Evals to test its models. This will be beneficial for OpenAI as well as its Users because this will improve the product and allow the customers to have a better experience with better features. They are hoping that it becomes a vehicle to share and crowdsource benchmarks, representing a maximally wide set of failure modes and difficult tasks.

For example, OpenAI created a logic puzzles evaluation that contains 10 prompts where GPT-4 fails. Along with several notebooks implementing academic benchmarks and a small subsets of CoQA (A Conversational Question Answering Challenge)

This is important because there’s a continuous pattern here where these LLMs are all fine-tuned using RLHF (reinforcement learning with human feedback). So contributions and benching by humans play a vital role here.

Unfortunately, OpenAI will not be giving any fees to contributors. However, the company plans to grant GPT-4 access to those who contribute high-quality benchmarks.

Artificial Corner’s Free ChatGPT Cheat Sheet

We’re offering a free cheat sheet to our readers. Join our newsletter with 20K+ people and get our free ChatGPT cheat sheet.

--

--