Member-only story

How to Evaluate LLM Applications with DeepEval— Part 1

Measuring Success in RAG-powered Chatbots

Gary Sharpe

Published in

The Model Observer

13 min readJan 21, 2025

Overview

In this article, I plan to complete and critique the work illustrated in the ‘tutorial’ series for DeepEval provided by Confident AI. I’ll explore a provided medical chatbot (powered by our OpenAI key) to demonstrate a replicable process for evaluating LLM driven applications.

This approach provides a framework that is adaptable to a wide range of other use cases. The key steps involve:

Defining Evaluation Criteria: Choose specific metrics or criteria relevant to your use case.
Using Evaluation Tools: Utilize DeepEval to assess your system’s performance based on the chosen criteria.
Iterating on Results: Refine prompts and model configurations iteratively to improve outcome

NOTE: the complete Google Colab Notebook can be found here.

This article is published under my Repl:it series, where I identify articles, new or old, that I want to ‘replicate’; a kind of Read, Evaluate, Print (interpretation) & Loop. In these attempts I often make some changes to suite my own architectural & development habits (e.g. use terraform to build cloud resources or Google Colab as the development and runtime platform). I provide a step-by-step (or play-by-play) report of what I did to replicate the article in focus, including missteps and errors encountered along the way.

How to Evaluate LLM Applications with DeepEval— Part 1

Measuring Success in RAG-powered Chatbots

Overview

Published in The Model Observer

Written by Gary Sharpe

No responses yet