Sitemap
Data Science Collective

Advice, insights, and ideas from the Medium data science community

Member-only story

Steal My Idea: Evaluating LLM Systems with Production Data at Scale

11 min readFeb 11, 2025

--

Building things with LLMs that need to work in the real-world? This post is for you. Source: Author provided.

In my last post [1], I described how my team and I have been testing our WIP conversational assistant, despite having no baseline or benchmarks, and despite the LLM testing landscape being relatively immature. But there’s still a gap when it comes to testing complex LLM-based systems and conversations, and today, I’m presenting a means to address it: An automated framework for evaluating multi-turn LLM interactions using real-world data.

The idea is that, if you can properly overwrite the “state” of a conversational AI system, you can simulate its behaviour in virtually any situation. Sounds simple, but the devil is in the (implementation) details, and those will depend on your application. This post is full of ideas for how you can build— and use—this new framework yourself.

The Final Testing Gap: Multi-Turn Conversations, with Realistic Data, at Scale

Let’s start with a little context. My team are currently building a conversational assistant — a customer service chatbot — for a large telco. It’s essentially a giant prompt describing our business processes and any information needed to execute them. At…

--

--

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Katherine Munro
Katherine Munro

Written by Katherine Munro

Data Scientist, speaker, author, teacher. Follow me on Medium or Twitter (@KatherineAMunro) for resources on data science, AI, tech, ethics, and more.