Member-only story

How to Evaluate LLM Applications with DeepEval— Part 1

Measuring Success in RAG-powered Chatbots

Gary Sharpe
The Model Observer
13 min readJan 21, 2025

--

Overview

In this article, I plan to complete and critique the work illustrated in the ‘tutorial’ series for DeepEval provided by Confident AI. I’ll explore a provided medical chatbot (powered by our OpenAI key) to demonstrate a replicable process for evaluating LLM driven applications.

This approach provides a framework that is adaptable to a wide range of other use cases. The key steps involve:

  1. Defining Evaluation Criteria: Choose specific metrics or criteria relevant to your use case.
  2. Using Evaluation Tools: Utilize DeepEval to assess your system’s performance based on the chosen criteria.
  3. Iterating on Results: Refine prompts and model configurations iteratively to improve outcome

NOTE: the complete Google Colab Notebook can be found here.

This article is published under my Repl:it series, where I identify articles, new or old, that I want to ‘replicate’; a kind of Read, Evaluate, Print (interpretation) & Loop. In these attempts I often make some changes to suite my own architectural & development habits (e.g. use terraform to build cloud resources or Google Colab as the development and runtime platform). I provide a step-by-step (or play-by-play) report of what I did to replicate the article in focus, including missteps and errors encountered along the way.

--

--

The Model Observer
The Model Observer

Published in The Model Observer

The Model Observer is a publication exploring the frontiers of generative AI, LLMs, and agentic frameworks. We analyze cutting-edge research, prototypes, and emerging trends — offering deep insights into how AI models learn, evolve, and shape the future of intelligence. 🚀

Gary Sharpe
Gary Sharpe

Written by Gary Sharpe

I build back-end systems for moving, munging and synchronizing data from one end of the enterprise to the other.

No responses yet