Part 1: Tech Stack Selection — How We Used LangSmith to Streamline Evaluation and Experimentation in LLM Product Development

Gaudiy Lab
Gaudiy Web3 and AI Lab
11 min readJul 22, 2024

Posted by seya, LLM App Engineer of Gaudiy Inc.

We at Gaudiy have recently released langsmith-evaluation-helper, a library that enhances the experience of evaluations using LangSmith.

The main functionality is as follows: you write a config and a function to execute LLM or a prompt template and a function to execute the evaluation.

description: Testing evaluations
prompt:
entry_function: toxic_example_prompts
providers:
- id: TURBO
config:
temperature: 0.7
- id: GEMINI_PRO
config:
temperature: 0.7
tests:
dataset_name: Toxic Queries
experiment_prefix: toxic_queries_multi_provider

Then, by executing this library’s command from the CLI as follows, LLM processes are executed, and the results can be confirmed on the LangSmith UI.

langsmith-evaluation-helper evaluate path/to/config.yml
LangSmith experiment history page
LangSmith experiment result details page

This system was created for the purpose of experiment management. In this article, I’d like to explain the background of creating this tool and other options we compared it with.

We Want an Experiment Management Environment!

First, let me explain the background of why we wanted an experiment management environment.

At Gaudiy, we’ve been focusing on product development using LLM since last year, but we were getting exhausted by our ad-hoc approach to prompt tuning.

Specifically, for a certain LLM-based feature, we were tuning it in an extremely straightforward manner:

  • Team members would extensively use the feature on the product and create a list of issues in Notion.
  • The person in charge of tuning would keep changing the prompt until those issues were resolved.

There were several problems with this approach:

  • The input patterns weren’t comprehensive
  • We weren’t keeping a history of experiments, so it was troublesome to revert to a previous version if we thought “wasn’t the previous one better?”
  • We couldn’t check for regressions
  • The process of “executing multiple inputs at once” or “repeatedly executing to confirm that problems don’t recur” was cumbersome
  • There was no agreement among stakeholders on what the final output should satisfy

As a result of these issues piling up, we were in a rather painful situation. So we thought, “We need to set up a proper system once, or we’ll be in the dark forever…”

Around that time, we learned about the concept of “experiment management” in the world of LLM App engineering, so we started to explore various tools in this area.

Considering Requirements

First, we organized the requirements we wanted to meet while researching tools. Broadly speaking, we thought the following would be necessary:

  • Make the deliverables related to tuning and evaluation a shared asset of the team that can be grown
    — Dataset (input and expected output)
    — Experiment results
    — Evaluation criteria
  • Make experimentation easier
    — Able to execute in parallel for multiple inputs
    — Able to repeatedly execute for the same input
    — Evaluation runs automatically for execution
    — Able to execute in parallel with multiple providers

Make the deliverables related to tuning and evaluation a shared asset of the team

The key to developing LLM-based features is growing the dataset and evaluation criteria. To check whether an LLM-based feature meets the quality required in product requirements, we need evaluation criteria, and to check whether these evaluation criteria are met, we need to actually have the LLM perform inference, so we need a list of inputs, i.e., a dataset, that can cover these aspects.

And what’s important is the idea of “we have to grow these assets”. This is because it’s impossible for humans to create evaluation criteria and datasets that comprehensively verify all product requirements from the start.

In the process of experimenting, we might realize “oh, there’s this aspect too” as we see various outputs, and the evaluation criteria might change. It might also change after release as we see how it’s actually used.

As a side note, the change in evaluation criteria and such as we use this system is sometimes called Criteria Drift.

So we thought it would be nice if the “dataset that was ultimately used for quality assurance” and “the evaluation criteria at that time (preferably executable code)” could accumulate as a shared asset of the organization, rather than just remaining in someone’s local work environment.

Making Experimentation Easier

This is simply about work efficiency, but in the process of experimenting, we often want to execute various things in parallel.

  • Check the results of multiple models to see which has the best cost/accuracy balance for that task
  • Want to repeatedly execute the same input to see if issues have been resolved
  • Want to execute multiple inputs in parallel

And so on. It’s a simple matter, but in the process of tuning, it’s not uncommon to execute prompts dozens or even hundreds of times. So this was an aspect we wanted to satisfy because it’s something that can be solved with simple programming and the effects are quickly visible.

When things got tough, I sometimes made my own Python scripts for efficiency, but since there are already places in the world that have implemented solutions, I thought we should properly investigate and set up once.

Requirements We Initially Considered but Dropped: Managing Prompts Outside of Code

As a side note, we initially thought it might be good to be able to manage prompts together, but we eventually dropped this from the initial requirements.

Currently, in our codebase, we manage prompt template strings within Python code and handle them by reading them as variables.

TOXIC_EXAMPLE_PROMPT = """
Given the following user query,
assess whether it contains toxic content.
Please provide a simple 'Toxic' or 'Not toxic'
response based on your assessment.
User content : {text}
"""
llm.invoke(TOXIC_EXAMPLE_PROMPT, **kwargs)

The issues we felt with this were:

  • Prompts are in a place that only engineers can touch
  • Changes to prompts need to be deployed each time to be reflected in the product

However, when we got an experiment management environment, the use case of “repeatedly checking behavior through UI” seemed to become less important. So we thought this issue would become less of a priority.

Regarding the issue of non-engineers not being able to touch prompts, there are quite a few tasks that need to be addressed to place them outside of code management, and there doesn’t seem to be a strong demand that outweighs these disadvantages, so we judged it’s not a must-have requirement.

By the way, few tasks that need to be addressed include:

  • Need to ensure that the inputs required in the prompt template are passed from the code (otherwise it will result in errors)
  • Need to introduce concepts like separating environments (prd, stg, dev) for prompt management as well

However, it’s not an extremely heavy requirement, and we feel that it might be nice to be able to manage configs (like temperature and which model to use) together with templates, so we feel that there might be momentum to start external management in the future.

Tool Selection: Starting with promptfoo

We looked for tools that would meet the above requirements (and also considered making our own), and tried the following:

  • promptfoo
  • LangSmith
  • PromptLayer
  • MLFlow

There are really a lot of other tools too, but we couldn’t look at all of them, so we just tried these for now. Below is a list of Observability Tools, which often include experiment management functions as well, so please take a look if you’re interested.

To state the conclusion first, we proceeded as follows:

  • At first, we thought promptfoo was flexible and nice
  • However, there were a few points that were lacking, so we considered LangSmith to cover those
  • LangSmith seemed good, but there were some experiences in promptfoo that we missed, so we created a small library to fill that gap

By the way, we gave up on MLFlow because the UI seemed a bit unrefined and the features seemed insufficient compared to other options. PromptLayer was also well-made, but compared to LangSmith etc., the only significant point seemed to be prompt template management, which, as mentioned earlier, we didn’t consider an important aspect this time. Also, we had already been using LangSmith as a logging infrastructure for a long time, so we gave up on this too.

We first tried promptfoo. Other colleagues also happened to have tasks to experiment with, so when we had them try it, the reviews were good and we thought “Maybe promptfoo is good enough?”

Broadly speaking, it had the following advantages:

  • Can execute in parallel when a dataset is prepared
  • Many preset evaluation functions that can be executed
  • Can execute multiple models in parallel and repeatedly
  • Can output as CSV and share by issuing URLs

However, it also had the following disadvantages (this is information as of April 2024, so some parts might have been resolved by updates now 🙏):

  • Need to return a prompt template = LLM execution is entrusted to promptfoo
    — This was quite fatal, as we couldn’t use it to evaluate the output of things with multiple steps like Chains or LangGraphs
  • When managing test datasets with CSV, the way to write evaluations is difficult
    — The way of writing by adding columns like `__expected1`, `__expected2`, `__expected3`
  • Sharing experiment results needs to be done manually
    — Need to output as CSV and upload to Notion or Slack
    — There is a share option, but it becomes a state where anyone can see it if they have the link
    — There is a self-host option, but it’s a bit troublesome to prepare the deployment

So, we felt that promptfoo would be fine if it’s “simple prompting” and “tuning tasks that are OK to be closed to individuals, but we wanted a solution that goes one step further!

to LangSmith ~ Development of a Library ~

Promptfoo was quite good with many desired options, but we considered a mechanism using LangSmith’s evaluate to see if we could resolve the above subtle points.

For reference, in LangSmith, you create a dataset like this:

And execute by importing the `evaluate` function from the LangSmith SDK in your code.

from langsmith import Client
from langsmith.schemas import Run, Example
from langsmith.evaluation import evaluate
client = Client()
# LLM execution
def predict(inputs: dict) -> dict:
messages = [{"role": "user", "content": inputs["question"]}]
response = openai_client.chat.completions.create(messages=messages, model="gpt-3.5-turbo")
return {"output": response}
# Evaluation function to assign scores
def must_mention(run: Run, example: Example) -> dict:
prediction = run.outputs.get("output") or ""
required = example.outputs.get("must_mention") or []
score = all(phrase in prediction for phrase in required)
return {"key":"must_mention", "score": score}
experiment_results = evaluate(
predict,
data=dataset_name,
evaluators=[must_mention],
)

Quoted and edited code from https://docs.smith.langchain.com/old/evaluation/quickstart, removing parts that are not important for conceptual understanding (⚠️ This code won’t run as is)

Then we realized all of the problems with promptfoo we mentioned earlier can be solved with LangSmith’s evaluate!

  • Need to return a prompt template = LLM execution is entrusted to promptfoo
    — It’s just a function in LangSmith evaluate, so you can write anything(You can also see associated logging of complicated process such as when using LangGraph)
  • When managing test datasets with CSV, the way to write evaluations is difficult
    — Again it’s just a function in LangSmith, so you can write anything
  • Sharing experiment results needs to be done manually
    — Can be shared by URL only to those who can log in to LangSmith

So we found that these requirements could be cleared with LangSmith, but we still didn’t want to give up the various convenient options of promptfoo…

So we created this library that utilizes LangSmith’s evaluate while providing a nice interface like promptfoo for executing prompts and processes.

As mentioned at the beginning, it allows you to write a config, a function to pass the prompt template, and a function to execute the evaluation, and then execute.

▼ Config

description: Testing evaluations
prompt:
entry_function: toxic_example_prompts
providers:
- id: TURBO
config:
temperature: 0.7
- id: GEMINI_PRO
config:
temperature: 0.7
tests:
dataset_name: Toxic Queries
experiment_prefix: toxic_queries_multi_provider

▼ Function to return prompt templatep

TOXIC_EXAMPLE_PROMPT = """
Given the following user query,
assess whether it contains toxic content.
Please provide a simple 'Toxic' or 'Not toxic'
response based on your assessment.
User content : {text}
"""
def toxic_example_prompts() -> str:
return TOXIC_EXAMPLE_PROMPT

It’s also possible to execute arbitrary functions instead of prompts. (The ID of the LLM Provider defined in the config is passed as an argument to the function)

from langsmith_evaluation_helper.schema import Provider
from langsmith_evaluation_helper.llm.model import ChatModel, ChatModelName
from langchain.prompts import PromptTemplate
def custom_run_example(inputs: dict, provider: Provider) -> str:
# replace with your favorite way of calling LLM or RAG or anything!
id = provider.get("id")
if id is None:
raise ValueError("Provider ID is required.")
llm = ChatModel(default_model_name=ChatModelName[id])
prompt_template = PromptTemplate(
input_variables=["text"], template="Is this sentence toxic? {text}."
)
messages = prompt_template.format(**inputs)
formatted_messages = PromptTemplate.from_template(messages)
result = llm.invoke(formatted_messages)
return result

▼ Function to execute evaluation

from typing import Any
from langsmith.schemas import Example, Run
def correct_label(run: Run, example: Example) -> dict:
score = # Some evaluation function
return {"score": score}
evaluators: list[Any] = [correct_label]
summary_evaluators: list[Any] = []

By preparing these and specifying the path to the config you want to execute, LangSmith’s evaluate is executed using the defined prompts or functions for the dataset when you run the following command:

langsmith-evaluation-helper evaluate path/to/config.yml

With this, we can now get the various benefits of LangSmith while being able to execute prompts and functions at once by just writing the information specific to each prompt case, with an interface like promptfoo. Hooray!

We’ve made it open source, so please use it if you’re utilizing LangSmith!

Also, my colleague namicky has written an article about more practical experiences of actually using this for prompt tuning. Even with tools, how to create datasets and conduct evaluations is very important and useful when utilizing LLM in products, so please read it as well.

Conclusion

So, that’s the story of our technology selection for the experiment management environment and creating a library.

While we want to properly accumulate datasets and evaluations as assets, we think the technology for experiment management environments is something that can be replaced even later. So for your first LLM product development, we think it might be good to quickly choose something that seems good, use it, and get a feel for it.

If you’re going to use LangSmith, please try our library as well!
That’s all! 👋

--

--

Gaudiy Lab
Gaudiy Web3 and AI Lab

The Gaudiy Lab, powered by Gaudiy Inc, utilizes pioneering technologies like blockchain and generative AI to create "Fan Nations".