Generative AI Evaluation with Promptfoo: A Comprehensive Guide

6 min readSep 10, 2024

In this blog post, we’ll explore how to use Promptfoo, a test framework designed to evaluate the output of generative AI models.

Secure & reliable LLMs | promptfoo

Eliminate risk with AI red-teaming and evals used by 30,000 developers. Find and fix vulnerabilities, maximize output…

www.promptfoo.dev

GitHub - promptfoo/promptfoo: Test your prompts, agents, and RAGs. Red teaming, pentesting, and…

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance…

github.com

Background

In the process of developing generative AI features and services, there are various changes such as:

Changing the generative AI model (e.g., from GPT-4 to Gemini 1.5 Pro, etc.)
Modifying and improving prompts
Registering, updating, and adding data for RAG

These changes are made as needed to improve the functionality of generative AI, but even small changes can significantly alter the output.

For example, various regressions can occur:

Example: Changing to a newly released model resulted in lower output quality compared to the previous model
Example: Modifying a single line in the prompt caused unnecessary output
Example: Changing the data for RAG resulted in the inability to answer questions that could previously be answered

Manually checking for regressions with each change is labor-intensive and requires significant time and resources.

Upon investigation, various generative AI evaluation frameworks exist (not all have been checked).

Promptfoo

After trying out a few tools, I found that Promptfoo, the main subject of this article, seems to be user-friendly, so I will introduce it.

Here are some features that give a good impression:

Can be completed locally
Low learning cost (even non-engineers might be able to use it with some effort)
- It’s relatively easy to set up the environment since you only need to install the CLI standalone
- Can be described in yaml
- Basic assert processing is built-in
Can test in a matrix like prompt x model x test case
- Easy to compare prompts and models
Surprisingly customizable
- Assert processing can be written in javascript/python
- The target of the test can be flexibly changed with the provider mechanism
- Although it is mainly intended for CLI use, it can also be called from an npm package

Try Running It

The sample created this time is uploaded to the following GitHub repository, so you can try it by cloning it.

GitHub - yukinagae/promptfoo-sample

Contribute to yukinagae/promptfoo-sample development by creating an account on GitHub.

github.com

Installation

Refer to Promptfoo — Installation.

Create a directory

$ mkdir promptfoo-sample
$ cd promptfoo-sample

Install the promptfoo CLI

# Recommended
$ npm install -g promptfoo
# or
$ brew install promptfoo

Verify the installation

It is recommended to install the latest version via npm, but for this example, we will use brew.

$ which promptfoo
/opt/homebrew/bin/promptfoo
$ promptfoo --version
0.83.2

Initialization (For now, select the most minimal option below)
- What would you like to do?: 1
- Which model providers would you like to use?: Choose later

$ promptfoo init

? What would you like to do? 1
  1) Not sure yet
  2) Improve prompt and model performance
  3) Improve RAG performance
  4) Improve agent/chain of thought performance
  5) Run a red team evaluation

? Which model providers would you like to use? (press enter to skip)
❯◉ Choose later
 ◯ [OpenAI] GPT 4o, GPT 4o-mini, GPT-3.5, ...
 ◯ [Anthropic] Claude Opus, Sonnet, Haiku, ...
 ◯ [HuggingFace] Llama, Phi, Gemma, ...
 ◯ Local Python script
 ◯ Local Javascript script
 ◯ Local executable

✅ Wrote promptfooconfig.yaml. Run `promptfoo eval` to get started!

The following two files should be created first:

README.md
promptfooconfig.yaml

Try Testing

As described in the newly created README.md, you can run the sample test in the following three steps:

Set the OpenAI API key as an environment variable

$ export OPENAI_API_KEY=<your-openai-api-key>

Run the evaluation with the following command:

$ promptfoo eval

Check the test results in the Web GUI. The following command will automatically open the browser at http://localhost:15500/eval/:

$ promptfoo view --yes

Sample evaluation results of Promptfoo — Graph

Sample evaluation results of Promptfoo — Table

Note: The 12 cases executed are due to matrix testing of [prompts (2 patterns)] x [providers (2 patterns)] x [test cases (3 patterns)].

Check the Test Contents

All tests are described in promptfooconfig.yaml.

description: "My eval"

prompts:
  - "Write a tweet about {{topic}}"
  - "Write a concise, funny tweet about {{topic}}"

providers:
  - "openai:gpt-4o-mini"
  - "openai:gpt-4o"

tests:
  - vars:
      topic: bananas

  - vars:
      topic: avocado toast
    assert:
      - type: icontains
        value: avocado
      - type: javascript
        value: 1 / (output.length + 1)

  - vars:
      topic: new york city
    assert:
      - type: llm-rubric
        value: ensure that the output is funny

Let’s go through the contents step by step.

description

This is the overall description of the test. It will be displayed in the top left corner when viewed in the Web GUI.

description: "My eval"

Sample evaluation results of Promptfoo — Location of description

prompts

Specify the prompts to be passed to the generative AI. Here, two patterns of prompts are listed.

prompts:
  - "Write a tweet about {{topic}}"
  - "Write a concise, funny tweet about {{topic}}"

providers

Specify the generative AI models to be used. Here, two patterns of models are listed.

Refer to Promptfoo — Providers for the providers that can be specified.

For OpenAI, all details are listed in Promptfoo — Providers > OpenAI.

providers:
  - "openai:gpt-4o-mini"
  - "openai:gpt-4o"

tests

You can specify variables with vars and test items with assert.

There are many test items that can be specified, so please refer to Promptfoo — Assertions & metrics.

tests:
  - vars: # Execute with the variable `topic` set to `bananas` (output only without assert)
      topic: bananas

  - vars: # Execute with the variable `topic` set to `avocado toast`
      topic: avocado toast
    assert:
      - type: icontains # Check that the test result contains the string `avocado`
        value: avocado
      - type: javascript # Custom checks can be done with `javascript` or `python`
        value: 1 / (output.length + 1)

  - vars: # Execute with the variable `topic` set to `new york city`
      topic: new york city
    assert:
      - type: llm-rubric # Check the output result with another generative AI model (default: gpt-4o)
        value: ensure that the output is funny

Advanced Usage

There are various examples of usage in the promptfoo GitHub repository, so please refer to them.

promptfoo/examples at main · promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance…

github.com

Additionally, the official documentation is extensive, so you might find what you want to do by looking at the following:

Prompts, tests, and outputs | promptfoo

Prompts

, tests, and outputs | promptfoo Promptswww.promptfoo.dev

Caution

Promptfoo is primarily designed to operate locally, but please be aware of the following points.

If you use the share button in the Web GUI or execute promptfoo share, the test results will be publicly shared via Cloudflare KV provided by Promptfoo (stored for two weeks).

It is generally recommended to avoid using this feature.

No, Promptfoo operates locally, and all data remains on your machine. The only exception is when you explicitly use the share command, which stores inputs and outputs in Cloudflare KV for two weeks.

Frequently asked questions | promptfoo

What is Promptfoo?

www.promptfoo.dev

Conclusion

In conclusion, Promptfoo is a user-friendly framework for evaluating generative AI outputs. Its local operation, low learning curve, and customization options make it valuable for both developers and non-engineers. Matrix testing and various assertions ensure thorough evaluation of changes.

For more detailed examples and advanced configurations, refer to the Promptfoo GitHub repository and the official documentation.

Happy testing!

Generative AI Evaluation with Promptfoo: A Comprehensive Guide

Secure & reliable LLMs | promptfoo

Eliminate risk with AI red-teaming and evals used by 30,000 developers. Find and fix vulnerabilities, maximize output…

GitHub - promptfoo/promptfoo: Test your prompts, agents, and RAGs. Red teaming, pentesting, and…

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance…

Background

Promptfoo

Try Running It

GitHub - yukinagae/promptfoo-sample

Contribute to yukinagae/promptfoo-sample development by creating an account on GitHub.

Installation

Try Testing

Check the Test Contents

Advanced Usage

promptfoo/examples at main · promptfoo/promptfoo

Test your prompts, agents, and RAGs. Red teaming, pentesting, and vulnerability scanning for LLMs. Compare performance…

Prompts, tests, and outputs | promptfoo

Prompts

Caution

Frequently asked questions | promptfoo

What is Promptfoo?

Conclusion

Written by Yuki Nagae

No responses yet