Member-only story
Using one LLM to Judge Another? Here’s Five Reasons You Shouldn’t
Concrete ‘gotchas’ for people building things with Gen AI.
As teams across the world scramble to build things with Gen AI and Large Language Models, countless startups are scrambling to build LLM evaluation tools to serve them. Many such tools use LLMs to judge the system’s output, be it the final results or just the LLM-component’s responses. This makes sense, for some stages of the product development process. But with my experience building chat- and voicebots with LLM backends, I have serious concerns about the ‘ubiquitous utility’ of LLMs as judges. In this post, I’m sounding the alarm.
My goal is not to decry LLM-based evaluation tools altogether. Instead, I hope to get you thinking about all your evaluation options, rather than just the trendy and obvious ones. It’s about considering your needs pre- and post-deployment, understanding which metrics are useful at which stages, and how you can most efficiently access them.
Everything’s easy… in the beginning
Before I go further, let me be clear: I do believe LLMs as judges can be very useful for early stages of product development, such as when you’re trying to envision and validate a new idea, or when you’re rapidly testing and iterate…