Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

No Baseline? No Benchmarks? No Biggie! An Experimental Approach to Agile Chatbot Development

13 min readAug 26, 2024

--

A photo of me (Katherine Munro) on stage presenting this article as a talk. To watch or listen to the recording, click here.
Today’s post recaps my recent talk on lessons learned trying to bring LLM-based products to production. You can check out the video here.

What happens when you take a working chatbot that’s already serving thousands of customers a day in four different languages, and try to deliver an even better experience using Large Language Models? Good question.

It’s well known that evaluating and comparing LLMs is tricky. Benchmark datasets can be hard to come by, and metrics such as BLEU are imperfect. But those are largely academic concerns: How are industry data teams tackling these issues when incorporating LLMs into production projects?

In my work as a Conversational AI Engineer, I’m doing exactly that. And that’s how I ended up centre-stage at a recent data science conference, giving the (optimistically titled) talk, “No baseline? No benchmarks? No biggie!” Today’s post is a recap of this, featuring:

  • The challenges of evaluating an evolving, LLM-powered PoC against a working chatbot
  • How we’re using different types of testing at different stages of the PoC-to-production process
  • Practical pros and cons of different test types

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Katherine Munro
Katherine Munro

Written by Katherine Munro

Data Scientist, speaker, author, teacher. Follow me on Medium or Twitter (@KatherineAMunro) for resources on data science, AI, tech, ethics, and more.