Evaluating GenAI Development: A Data-Centric Approach vs. Purpose-Built Tools
As data professionals, we’re increasingly being called upon to support or lead Generative AI (GenAI) initiatives within our organizations. This shift presents both challenges and opportunities. While purpose-built tools like Braintrust promise streamlined GenAI evaluation, we posit that our existing data expertise might offer a more powerful and flexible approach.
In this article, we’ll explore a crucial question facing data teams venturing into GenAI: Should we adopt specialized evaluation platforms, or can we leverage our data engineering skills to create a more adaptable solution? Drawing from our hands-on experience with both Braintrust and data-driven methodologies, we’ll examine the trade-offs and demonstrate why treating GenAI development — especially the evaluation phase — as a data problem might be the key to success. We raised this idea at Transforming GenAI Development: The Power of Treating It as a Data Problem.
For data teams already comfortable with tools like SQL, dbt, and data pipelines, this perspective could significantly smooth the transition into GenAI development. Let’s dive into how our data expertise can be a game-changer in the world of GenAI evaluation and iteration.
You Don’t Need a Purpose-Built Tool
When embarking on GenAI evaluation, it’s tempting to reach for a specialized solution like Braintrust. However, our experience suggests that general solutions may often be superior. Here’s why:
1. Adaptability Challenges: Purpose-built tools, while initially appealing, can struggle to keep pace with the rapidly changing needs of GenAI development. By the time a team fully adopts and learns these tools, they may already be insufficient for evolving requirements.
2. The Data-Centric Mindset Shift: The key insight we’ve gained is that evaluating GenAI development is fundamentally a data problem, not a tooling problem. This shift in perspective opens up more flexible and powerful approaches to assessment.
3. Control and Flexibility: When you frame GenAI evaluation as a data challenge, you retain full control over your processes and maintain the adaptability to pivot as needed — something that purpose-built tools often struggle to provide.
Lessons from Our Braintrust Experience
We apply our reccedog project into Braintrust so we can have a compare how it works with a data approach.
Highlights of Braintrust
- Ease of initial setup
- Predefined evaluation metrics
- User-friendly interface
For those unfamiliar with Braintrust, here’s a brief overview of its key features. This walk-through will help you understand the platform’s capabilities and how it aims to streamline GenAI development and evaluation processes.
Experiments
A.k.a evaluations. This is a framework for measuring the performance of your AI application.
To evaluate your AI application, you need to provide an evaluation dataset to measuring the resolve of your application. Braintrust provides an Eval
class for you to define your evaluation process in python or typescript code.
pypython
from braintrust import Eval
from autoevals import Levenshtein
Eval(
"Say Hi Bot", \# Replace with your project name
data\=lambda: \[
{
"input": "Foo",
"expected": "Hi Foo",
},
{
"input": "Bar",
"expected": "Hello Bar",
},
\], \# Replace with your eval dataset
task\=lambda input: "Hi " + input, \# Replace with your LLM call
scores\=\[Levenshtein\],
)
Although the braintrust provides the evaluation feature, you still need to prepare your own dataset, implement the evaluation task and define the scoring mechanism. It’s more like an infrastructure for you to quick setup your GenAI application. Let you focus on adjusting the prompt instead of turning the engineering things.
Datasets
In the previous session, we set up the dataset inside the evaluation code. However, braintrust also provides a way to management the evaluation datasets from the cloud platform.
You can manually add your own evaluation dataset or import the datasets from CSV/JSON file.
Cloud management datasets feature is the most useful features on Braintrust. In general, the evaluation datasets could be handled by non-technical role. A cloud platform for editing/reviewing datasets could be a good interface for the non-technical persons. Those people don’t need to take the risk to modify code base when updating evaluation datasets. You don’t need to implement any line of code to management the datasets either. However, it’s important to note that you still need to manually manage these datasets, which can be time-consuming for large or frequently changing datasets.
Prompts
A GenAI application is the art of using prompts. Using an appropriate prompt is the core value of your AI application. Compare the result of different prompts is the main thing of developing the AI application. Braintrust Prompts feature provides you to manage the version of your prompts. Developer can also execute the prompt on the Braintrust platform without modifying the application code.
Logs
Observation is an important task for operating a software application. From a developer’s perspective, it will take a lot of effort to insert observation code if we don’t implement it since the beginning of a software project.
Braintrust Logs feature provides an easy method to insert the observation code into your GenAI application. By adding the decorator and wrap the llm library call, you can easily collect metics from your production environment.
Take-away of Braintrust
Braintrust offers a comprehensive suite of tools for GenAI application development and evaluation. At its core, it provides a framework to rapidly establish the infrastructure for operating a GenAI application. The platform excels in experiment management, offering robust tools for tracking different iterations and comparisons. It also includes features for efficient management and versioning of evaluation datasets, as well as prompt versioning for easy management and comparison of different prompts.
One of Braintrust’s strengths lies in its logging capabilities, providing built-in features for better observability of GenAI applications. While users can implement custom evaluation logic, Braintrust integrates with the open-source Autoevals library, offering efficient scoring methods out of the box.
The key advantage of Braintrust is its ability to reduce initial development time and resources, making it especially beneficial for teams starting from scratch in GenAI development. Its user-friendly interface and predefined tools can significantly accelerate the GenAI development process.
However, it’s important to note potential limitations. Teams with existing GenAI applications may find less value in adopting the full platform if they’ve already implemented similar features internally. Additionally, Braintrust may introduce a separate workflow from existing data processes, potentially creating silos within an organization’s data ecosystem.
In summary, Braintrust provides valuable infrastructure for GenAI development, particularly suited for teams new to GenAI or those looking for a comprehensive, out-of-the-box solution. However, its value proposition may diminish for teams with established GenAI workflows or those seeking deep customization and integration with existing data ecosystems. The decision to adopt Braintrust should be weighed against an organization’s specific needs, existing capabilities, and long-term GenAI strategy.
Autoevals as a Flexible Tool for Both Approaches
In our exploration of Braintrust, we’ve discovered Autoevals can be a powerful tool that bridges the gap between custom data-centric approaches and purpose-built platforms. This open-source library deserves special attention as it offers flexibility and efficiency regardless of your chosen evaluation strategy.
Autoevals as an Open-Source Library
The Autoevals is not a component of Braintrust platform . It’s a client side library you can import into your evaluation code. Since it is not tied to any specific platform, making it a versatile choice for data teams. As an open-source library, it can be easily integrated into existing data pipelines or used in conjunction with tools like Braintrust. This flexibility allows teams to leverage its capabilities without being locked into a particular ecosystem.
Instead of using mathematics method to scoring the GenAI results. Autoevals also provide using LLM to judge your result.
- Battle: Test whether an output better performs the
instructions
than the original (expected) value. - ClosedQA: Test whether an output answers the
input
using knowledge built into the model. You can specifycriteria
to further constrain the answer. - Humor: Test whether an output is funny.
- Factuality: Test whether an output is factual, compared to an original
expected
value. - Moderation: A scorer that uses OpenAI’s moderation API to determine if AI response contains ANY flagged content.
- Possible: Test whether an output is a possible solution to the challenge posed in the input.
- Security: Test whether an output is malicious.
- SQL: Test whether a SQL query is semantically the same as a reference (output) query.
- Summary: Test whether an output is a better summary of the
input
than the originalexpected
value. - Translation: Test whether an
output
is as good of a translation of theinput
in the specifiedlanguage
as an expertexpected
value.
Using Autoevals can significantly reduce the time required to implement evaluation code for GenAI applications. Its pre-built evaluation methods allow developers to quickly set up robust assessment systems without writing everything from scratch.
Practical Application in Custom Pipelines and Braintrust
For data teams adopting a custom, data-centric approach:
- Autoevals can be seamlessly integrated into existing data workflows.
- It provides a solid foundation of evaluation metrics that can be extended or customized as needed.
- The library can be used to automate parts of the evaluation process, freeing up time for more complex analyses.
For teams using Braintrust:
- Autoevals can complement Braintrust’s built-in features, offering additional evaluation options.
- It allows for more customized evaluations within the Braintrust framework.
- The library can be used to standardize evaluation metrics across different projects or teams.
Benefits:
- Reduces development time for evaluation metrics.
- Provides consistency in evaluation across different projects.
- Offers a balance between out-of-the-box functionality and customization potential.
- Can be used to benchmark custom metrics against standardized evaluations.
Limitations:
- May require some coding knowledge to implement and customize effectively.
- Predefined metrics might not cover all specific use cases, necessitating additional development.
- As with any external library, teams need to stay updated with new versions and potential breaking changes.
In our experience, Autoevals has proven to be a valuable asset in our GenAI evaluation toolkit. Its flexibility allows it to enhance both custom data-centric approaches and platform-based solutions like Braintrust. By leveraging Autoevals, data teams can quickly establish a robust evaluation framework while maintaining the freedom to adapt and extend as project needs evolve.
Whether you choose a fully custom data pipeline or opt for a tool like Braintrust, consider how Autoevals might fit into your evaluation strategy. It could be the key to balancing efficiency, standardization, and customization in your GenAI development process.
Final Thoughts
As data professionals venturing into GenAI development, we’ve explored the trade-offs between purpose-built tools like Braintrust and treating evaluation as a data problem. Our journey has revealed several key insights:
1. Data-Centric Approach: Leveraging our existing data expertise offers unparalleled flexibility and integration with our current workflows. It allows for deeper insights and customization but requires more initial setup.
2. Braintrust: Provides a streamlined, user-friendly platform with predefined metrics and easy experiment tracking. It’s quick to set up but may limit customization and create a separate workflow from existing data processes.
3. Autoevals: Serves as a versatile, open-source tool that can enhance both custom data pipelines and Braintrust implementations. It automates common evaluations, saving time and ensuring consistency across projects.
The choice between these approaches depends on your team’s existing capabilities, project requirements, and long-term vision for GenAI integration. For teams starting from scratch, Braintrust can significantly reduce development time and resources. Its predefined metrics and user-friendly interface make it an attractive option for those new to GenAI evaluation.
However, for teams with existing GenAI applications or robust data infrastructure, the data-centric approach may be preferable. It offers deeper analysis capabilities, greater control, and seamless integration with existing data workflows.
For long-term success, we recommend treating GenAI evaluation as a data problem. This approach ensures flexibility, scalability, and the ability to adapt to evolving project needs. Consider using Autoevals in conjunction with either approach to automate common evaluations and provide a solid foundation while allowing for customization.
Remember, the field of GenAI is rapidly evolving. Whichever approach you choose, maintaining flexibility and continuously refining your evaluation processes will be key to staying ahead in this dynamic landscape. By treating GenAI evaluation as a data problem and leveraging tools like Autoevals, you’re well-positioned to adapt to new challenges and opportunities as they arise.
—
I would like to extend my sincere thanks to Kent Huang for his valuable insights and contributions, which have been instrumental in shaping this article. His findings and expertise provided the foundation for many of the key points discussed here.