Part 2: Process— How We Used LangSmith to Streamline Evaluation and Experimentation in LLM Product Development

Published in

Gaudiy Web3 and AI Lab

13 min readJul 22, 2024

Posted by Wataru Namiki, Software Engineer of Gaudiy Inc.

Over the past 1–2 years, the excitement in the generative AI / LLM field has accelerated significantly, and many people are focusing on how to utilize it to provide new value.

Our company, Gaudiy, saw potential in this field relatively early and has been actively taking on challenges. As we develop LLM products, we’re refining our processes to improve daily based on the knowledge accumulated from addressing challenges that arise.

This time, I’d like to focus on the process of prompt tuning, which is inseparable from development in this field. I’ll introduce how we’re improving efficiency and resolving common issues, along with some use cases.

Note: This article is the second part of a two-part series, covering the “Process Implementation” section. I hope you’ll also read the first part by seya, which covers the “Tech Selection/OSS” section.

Part 1: Tech Stack Selection — How We Used LangSmith to Streamline Evaluation and Experimentation…

Posted by seya, LLM App Engineer of Gaudiy Inc.

medium.com

What we build — Fanlink

At Gaudiy, we provide “Fanlink,” a community platform for multiple IPs (intellectual properties).

Although it’s still in beta and we can’t share many details, we’re currently developing and iterating on a feature that allows users to chat with AI avatars of specific characters (such as actual celebrities or anime characters) on Fanlink, in collaboration with specific IP holders. We call this feature “AI Talk.”

For this feature, we use LLM to generate AI messages. LLM is an essential component, and it’s not an exaggeration to say that it greatly influences the quality of the feature.

Let me briefly introduce the current AI Talk feature:

Users can send messages to AI avatars (hereafter referred to as character AI) that behave like specific characters on Fanlink. The character AI retains information about its unique characteristics and memories, and is always prompted to behave appropriately in response to user messages and the flow of conversation.

In addition to text messages, we also offer supplementary experiences such as giving digital item gifts. We’re also considering features that allow multimodal inputs like voice in the future.

Prompts include not only replies to user messages but also various scenarios such as initiating conversation topics, self-introductions, and ending conversation flows. We use different prompts in our logic depending on various contexts. This allows us to create a mechanism that can behave more naturally from the start to the end of a conversation segment, or seamlessly transition to the next topic.

Next, let me briefly introduce the architecture:

When a message is sent from a user, we stack a task in the reply generation queue (using Cloud Pub/Sub). When processing begins, we retrieve information from a DB that stores character features based on a unique ID representing which character AI’s information it is.
Note: The character AI itself is structured so that multiple individuals can be built separately, and we’re currently operating and verifying two character AIs.

Furthermore, we use vector search to extract information close to the user’s message query from pre-stored base character AI memory data and summaries of past conversations between users and character AI in the Vector Store (using Vertex AI) .

Then, based on context information such as the start of the conversation or how many times the conversation has occurred, we determine which prompt template to use. We extract only the necessary inputs for the prompt from various data obtained above, such as character information and related memories, and generate messages by invoking LLM using the LangChain interface.

This approach allows us to realize the behavior of multiple character AIs with a single prompt template.

While very simple tasks are composed of single prompts, there are also patterns where we chain-execute prompts in multiple steps using LangGraph for more complex context recognition and cost reduction (we’ll omit the detailed technical content here).

The execution of such LLMs can be traced and viewed on the GUI provided by LangSmith from LangChain, making it very easy to debug when hallucinations occur.

Challenges Faced in Prompt Tuning Tasks

As mentioned earlier, multiple prompts are used to realize the character AI talk feature, and the importance of prompt tuning to improve the accuracy of each one was very high.

However, we hadn’t established an efficient way of prompt tuning, and we were facing various challenges such as ballooning time-costs due to ad-hoc approaches.

The biggest challenge was that the common language between the engineers in charge of tuning and the stakeholders (our Biz members) was only qualitative results based on the final behavior of the application. This resulted in large costs and losses for engineers to break down and analyze abstract requirements to the implementation level.

Previously, after a set of features was fully implemented, we operated with a system where our Biz members, who were well-versed in the AI’s characteristics and in close coordination with the IP, essentially became the stakeholders and were asked to check the quality.
The result of the quality check was a qualitative list of items from an infinite number of perspectives, such as whether it met the characteristics of the base character AI, including speech patterns.

While the quality check results were correct, the burden on engineers to interpret each of these qualitative perspectives and implement them into prompts (sometimes prioritizing them) while meeting all of them was very high.

In this work, where the completion criteria were difficult to see, there were times when the work itself naturally disappeared or development proceeded without satisfactory tuning. In the end, there were even unfounded conclusions such as the quality had decreased because we changed to a cheaper LLM model.

When doing prompt tuning, problems like “if you stand this up, that falls down” often occur. The communication cost of having to ask questions each time about which requirements ultimately needed to be realized as must-haves was also significant.

We also used LangSmith’s Playground to check and tune prompt behavior, but in this case, there was no history left, and each trial had to be done manually, which was very costly.

This approach would also be very costly when comparing between LLM models to determine which is optimal for the prompt in question.

Improving Efficiency in Prompt Tuning

Planning Phase

Although an AI team had been formed and we were gaining more members familiar with this field, the AI team was mainly focused on R&D and couldn’t approach these messy process issues. As a member of the product development team, I felt there was a big issue here, so I thought it would be good if we could at least achieve the following to resolve this:

Create a dataset in some kind of document with multiple expected inputs and examples of correct answers as expected values
Prepare an environment that can automatically execute LLM and generate execution results based on the prepared dataset
Prepare persistent data that can serve as a common language for tuning personnel (engineers) and stakeholders (Biz) to discuss improvements based on execution results
Easily switch to any LLM model and compare results

Implementation Phase

First, we decided to create a dataset of multiple expected inputs and correct cases in a Google Spreadsheet.
The reason for this choice was that it matched the characteristics of spreadsheets, which allow multiple people to edit in real-time online and easily reflect changes in common definitions to the entire dataset using sheet/cell references.

We structured the spreadsheet as follows:

A common definition sheet that duplicates DB data to make it easier to tune the character AI’s feature information itself
Sheets (multiple) for each prompt, summarizing the necessary inputs and outputs for each case in one row, created for as many patterns as needed

We set up select boxes on the sheets for each prompt to set which character AI the tuning case is for, and used formulas in cell values to auto-fill dynamically from the common definition sheet side.

We asked Biz members to create other necessary user message inputs and expected output values for each case to cover all patterns.

Next, for the efficiency of LLM execution and extraction of execution results, we had almost no technical knowledge in this area and were about to start digging when the AI team volunteered to collaborate. After re-syncing the sense of issues and the world view we wanted to realize, they proposed a mechanism to batch execute for each dataset and persist execution verification results on LangSmith using LangSmith’s Datasets & Testing feature!

Since the API for effectively using LangSmith’s Datasets & Testing feature is published by LangChain, we were able to realize our requirements by wrapping these for our use case.

*This Helper is currently also published as an OSS library. For more details, please read the repository mentioned below or seya’s article introduced at the beginning.

GitHub - gaudiy/langsmith-evaluation-helper: Helper library from LangSmith that provides an…

Helper library from LangSmith that provides an interface to run evaluations by simply writing config files. …

github.com

From here, I’ll briefly illustrate the overall realization flow and show part of the implementation for efficiently conducting prompt Evaluation along with how to use the Helper (please note that it may differ slightly from the current Helper OSS interface).

First, we implement and execute a Python script (dataset.py) to read information from Google Spreadsheet, convert it to the format of LangSmith’s Datasets Example, and register it.

For reading, we used the Google Sheet API provided within GCP, which Gaudiy usually uses as infrastructure.
To use the Google Sheet API from the script, we realized it by granting viewing permissions to the previously created service account in the sharing settings on the spreadsheet side (note that the way of assigning permissions is different from IAM).

We perform necessary mapping and formatting processing on the read dictionary data, and call the Helper to register/overwrite to any Dataset Example on LangSmith.

# LangSmith Dataset Name
LANGSMITH_DATASET_NAME = "Test Dataset"
# Google Spreadsheet ID
SPREADSHEET_ID = "hogehoge"
# Google Spreadsheet name
SHEET_RANGE = "test_prompt_sheet"
# path to SA json for Google Sheet API
SERVICE_ACCOUNT_KEY_PATH = os.getenv("SERVICE_ACCOUNT_KEY_PATH")

async def dataset() -> None:
        # fetch Spreadsheet
    values = get_sheet_values(SHEET_RANGE)

    # map and create input data
    header_row = values[0]
    header_columns_count = len(header_row)
    data_rows = values[1:]

    dict_list: list[dict[str, Any]] = []

    for row in data_rows:
        name = row_dict.get("name", "")
        bio = row_dict.get("bio", "")
        attitude = row_dict.get("attitude", "")
        last_message = row_dict.get("last_message", "")
        output = row_dict.get("output", "")

        dict_list.append({
            **TestPromptInput(
                name=name,
                bio=bio,
                attitude=attitude,
                last_message=last_message,
            ).model_dump(),
            "output": output,
        })

        # Call helper library
    create_examples(dict_list, dataset_name=LANGSMITH_DATASET_NAME)

def get_sheet_values(sheet_range: str) -> list[Any]:
    credentials = service_account.Credentials.from_service_account_file(
        SERVICE_ACCOUNT_KEY_PATH,
        scopes=["https://www.googleapis.com/auth/spreadsheets.readonly"],
    )

    service = build("sheets", "v4", credentials=credentials)

    sheet = service.spreadsheets()
    result = sheet.values().get(spreadsheetId=SPREADSHEET_ID, range=sheet_range).execute()
    values = result.get("values")

    return values

if __name__ == "__main__":
    asyncio.run(dataset())

By executing this script (python -m dataset.py), we were able to reflect all inputs and outputs defined in the spreadsheet to the Examples of any Dataset.

Next, we describe the specification of Datasets, LLM models, builtin check functions we want to use, etc. in the YAML file (config.yaml) that defines the config necessary for the Experiment using the Helper.

description: test prompt

prompt:
  name: prompt.py
  type: python
  entry_function: prompt

evaluators_file_path: evaluations.py

providers:
  - id: TURBO

tests:
  type: langsmith_db
  dataset_name: Test Dataset
  experiment_prefix: test_

By the way, if you want to apply custom checks, you can easily run that evaluation by writing evaluation logic in evaluations.py and specifying it in the config.

For example, we were able to implement and use custom check functions for accuracy judgment of positive and negative for prompts that return scores including positive and negative.

def same_sign(run: Run, example: Example) -> dict[str, bool]:
    if run.outputs is None or example.outputs is None or run.outputs.get("output") is None:
        return {"score": False}

    example_output = example.outputs.get("output")
    if example_output is None:
        raise ValueError("example.outputs is None")

    example_score = int(example_output)
    output_score = int(run.outputs["output"].score)
    if example_score == 0 or output_score == 0:
        return {"score": example_score == output_score}

    example_sign = example_score // abs(example_score)
    output_sign = output_score // abs(output_score)
    return {"score": example_sign == output_sign}

This wasn’t in the initial requirements, but we were taught that such things can be used as Experiments functions.

The score results of the implemented check functions can be visually displayed in Charts on LangSmith’s Experiments, which was also very easy to understand as an evaluation indicator.

Finally, we define the implementation of the prompt template to be executed in a separate file (prompt.py), referring to the actual production-side prompt template definition. (It’s also possible to define the prompt text directly on this file)

@traceable
async def prompt(inputs: dict[str, Any]) -> TestPromptTemplate:
    return TestPromptTemplate(
        input=TestPromptInput(
            name=inputs.get("name", ""),
            bio=inputs.get("bio", ""),
            attitude=inputs.get("attitude", ""),
        )
    )

Then, when you execute the alias command that refers to the config, LLM execution is performed for the number of registered datasets, and after execution is complete, all execution results are saved in Experiments on LangSmith.

make evaluate <path/to/config.yml>

Since there are URL links for each of these Experiments results, sharing this link made it easy for tuning personnel and stakeholders to discuss execution results and modification policies 🎉

Operation Phase

As mentioned earlier, the actual prompt tuning task involved multiple prompts, so we divided the work, including the above implementation, among engineer members (basically, one person was in charge of one prompt each).

We summarized in Notion, for each prompt, what kind of logic was used to correct phenomena that engineers could tune and were clearly wrong against the expected value, so that we could track the content of tuning chronologically.

Also, when we got good results to some extent, we shared the Output LangSmith Experiments URL with Biz members for review, continued to make corrections based on their feedback, or completed if there were no problems.

Results and Considerations

Previously, we had an unfavorable situation where engineers and PdMs familiar with LLM development were stuck on prompt tuning for about two weeks, repeatedly reviewing and modifying prompt improvements, or sometimes the tuning itself was neglected.

With the efficiency improvement of prompt tuning, we were able to achieve results where the data creation work in spreadsheets centered on Biz took a few days, the implementation of Datasets and Evaluation Script by engineers took 0.5 to 1 day, and the tuning and review implementation took a few days depending on the complexity of the prompt (actually, we were able to complete the tuning implementation period in about a week).

This achievement is expected to be quite good information as a guideline for doing similar work in the future.

Also, while previously only specific members performed this task, having all engineers in the team share this process had secondary effects such as acquiring context and knowledge of prompt tuning and mutual review.

Until now, we had Biz members actually touch the completed application, perform qualitative quality evaluation, and list modification perspectives. This time, by having them register the expected output values as a dataset in advance as ground truth, we were able to greatly reduce the psychological burden on the tuning side.

We gave Git links to the prompt definition files to Biz members in advance, explained and had them understand the actual prompt text itself, how it works, and in what situations the relevant prompt is executed, before conducting tuning. As it became easy to view Chain trace data from LangSmith’s Experiments results, we were able to receive suggestions from Biz members for modification policies, such as “Wouldn’t it be better to change this notation in this prompt like this?” As engineers, it’s really helpful to receive such detailed perspectives.

Prompt tuning work is directly linked to quality and cannot be completed by engineers alone. We also gained the realization that it’s very important for context to mutually permeate including surrounding members.

Future Challenges

Our efficiency improvements are still quite rough and ad-hoc improvements at present.

We expect various challenges to arise as we continue to develop LLM products in the future, but here are some examples of future challenges and the improvements we are considering.

In particular, when it comes to mass-producing character AIs in the future, the human power cost aspect of Biz members is a very big challenge, so we are considering approaching this next.

Human power cost of creating feature and memory data for creating character AI
— Automatically generate feature and memory data from personality definition data using generative AI
Human power cost of creating ground truth data as expected Output values
— Infer and automatically generate correct data for new AI based on existing datasets and personality definition data using generative AI
Quality issues and cost issues of parts where prompt accuracy Evaluation and tuning are done by human power
— Consider mechanisms for AI to evaluate and tune using LLM as a Judge

Conclusion

I introduced an approach to improve efficiency using Google Spreadsheet and LangSmith’s Helper for challenges that arise in the rather common task of prompt tuning.
I feel we were able to create one case of how to improve efficiency in labor-intensive and time-consuming tasks, and how to apply new technology like generative AI to resolve it.
We hope to continue improving and make LLM product development better.

That’s all, thank you for reading.

The Gaudiy Web3 and AI Lab, operated by Gaudiy’s staff and collaborators, specializes in advancing research and…