Prompt Evaluation: Systematically testing and improving your Generative AI prompts at scale. Code included!

7 min readApr 1, 2024

EDIT June 2024: Check out this updated blog for a more robust prompt evaluation code example!

As Generative AI workloads move to production the forest of prompts, prompt templates, and prompt variations continues to grow. Like an overgrown jungle, this morass of prompts can quickly become impossible to navigate and maintain. Prompts are the life blood of these workloads; the quality of a prompt can make or break an AI system so we need to be able to maintain and improve them at scale. In this blog, I present a method for automatically and systematically evaluating and improving prompts at scale, along with example code in a Jupyter Notebook on Github.

Where are we going wrong?

Prompts can grow over time for many reasons. As a system grows, parts of a prompt may be added to in order to support new use cases. If the new use case is different enough, we may branch into an entirely new copy of the prompt and maintain both in production. In addition, we often collect feedback from end users which can require tweaks in a prompt to address problems. It can start to feel like a game of wack-a-mole when a user reports a problematic LLM response, a ML engineer tweaks the prompt to address the problem, and…

Prompt Evaluation: Systematically testing and improving your Generative AI prompts at scale. Code included!

Where are we going wrong?

Written by Justin Muller