Transforming GenAI Development: The Power of Treating It as a Data Problem
GenAI development is a data problem
When developing a GenAI use case, the process typically looks like this: input → LLM logic → output. But here’s the key insight: focusing solely on code and model logic isn’t enough. The success of any GenAI project hinges on the quality of the data feeding the system. The inputs, the impact of code changes on data, and the data itself are what truly drive results. This is the core idea: GenAI development is not just about model tuning — it’s fundamentally a data problem. If you’re focusing on data in your GenAI projects, you’re already ahead of the curve.
Can’t quite see it yet? Let me show you an example.
Developing a GenAI project
To demonstrate why GenAI development is fundamentally a data problem, let me walk you through an internal project. We used to spend hours each week manually sourcing leads from Medium engagements, such as claps or saved posts, and matching those names to LinkedIn profiles for outreach. This process was tedious and heavily relied on human intuition — recognizing patterns like matching names, profile pictures, and other subtle cues that only a trained eye might catch.
Naturally, we wondered: Can GenAI automate this for us? Could it replicate the nuanced judgment that data professionals apply when making these connections? This is the type of logic we aimed to automate with our GenAI solution.
We envisioned a GenAI application that could take over this manual effort. The concept was straightforward: input users from Medium claps, output LinkedIn profiles. We believed GenAI could handle this transformation seamlessly. This is how our solution, Reccedog, came to life.
Here’s how Reccedog works: You input a name, and it provides suggested leads, complete with explanations for each match.
GenAI Limitation
While LLMs like ChatGPT are powerful, they can’t directly access LinkedIn’s database. To bridge this gap, we needed an external source that could provide LinkedIn data. That’s where Apollo.io came in, offering a search API to pull the necessary information.
This divided the mapping process into two parts:
- Search LinkedIn via Apollo
- Suggest mapping via LLM
The entire process unfolds like this:
The system prompt combines input from Apollo’s search results with a carefully crafted LLM prompt. Here was our initial prompt:
Analyze the CSV file to identify individuals. Provide your response as a Markdown table with the following columns: Name, LinkedIn Profile, Relevance (Explain how the individual meets the criteria).
Evaluating LLM Iterations
With every iteration — whether it’s tweaking the system prompt, adding pre-processing steps, modifying API result parsing, or adjusting configurations — we aim to determine if the LLM’s responses are improving. But how do we evaluate these changes effectively? How do we know if the result is actually better — or at least not worse?
One option is to have humans manually review and rate the results, or use an evaluation service to automate this. However, relying on human reviewers is too slow and disrupts the development cycle. We needed a more efficient solution — one that allows developers to evaluate the results themselves.
For example, when we modify the LLM configuration in the YML file, we want to compare if costs were reduced while maintaining output quality. YML is semi-structured, and comparing it with output data in a pipeline would make these comparisons straightforward. Without a pipeline, developers would need to write custom Python scripts to cross-check YML files, which isn’t scalable.
While evaluation services can provide a high-level metric, they often lack transparency — you get a score but not the underlying data. To truly understand what’s improving (or not), you need direct access to the raw evaluation data. In Reccedog, for instance, Apollo outputs JSON files, which aren’t easy to analyze. Writing custom code for each evaluation makes it even less efficient and far from systematic.
To make this evaluation process systematic, we transform the raw JSON files into structured data. For example, we extract key attributes from Apollo’s JSON responses and organize them into a table with columns such as query
, apollo_response
, and linkedin_profile
. By converting unstructured API outputs into structured datasets, it becomes much easier to query, compare, and analyze data across iterations.
This transformation allows us to treat the evaluation process like any other data problem — adding structure around the results so we can measure and track improvements consistently.
Querying the Evaluation Data
To evaluate each LLM iteration effectively, we integrated the evaluation process into a structured data pipeline. We converted the prompt configurations, Apollo results, and LLM responses into parquet files, and loaded them into DuckDB as source data. This allows us to transform the data easily and calculate the necessary metrics.
Below is the concept of our Evaluation LLM Pipeline:
Here are the details: We created an evaluation dataset that includes query data from Apollo and ground truth from our previous outreach efforts. This dataset forms the basis for comparing the LLM’s responses with the real LinkedIn profiles (ground truth). For each query, we assess whether the LLM’s response correctly matches the ground truth — classifying each as a “hit” or “miss.” This forms the evaluation result dataset.
- EVAL_DATASET: Contains the query, Apollo response, and LinkedIn ground truth (profile URL).
- EVAL_RESULT: Tracks each LLM response against the ground truth, with a hit/miss classification (True/False).
- EVAL_METRIC: Accuracy is calculated as the ratio of correct hits (True) to total responses (N).
By calculating these metrics, we gain insight into the LLM’s performance on each iteration.
Once these evaluations are transformed into structured data models, querying becomes straightforward. For instance, we can easily calculate accuracy or any other key metric. Since we built Recce, we use it to quickly query and compare differences between iterations, allowing us to track improvements or issues efficiently.
This structured approach to evaluation helps us monitor iteration quality in real time and ensures that our development cycle remains data-driven and transparent.
Immediate Evaluation to Speed Up Iterations
As mentioned earlier, we talked about modifying the LLM configuration in the YML file. Once we set up the evaluation data pipeline, we were able to immediately compare the effects of any model changes, such as cost reduction or output quality.
Here’s an example: we switched between two models — GPT-4o and GPT-4o-mini. With the evaluation pipeline in place, we could quickly see the cost comparison, along with the impact on accuracy and other metrics. This helped us assess whether the reduced cost of using GPT-4o-mini was worth any trade-off in performance.
Having this data readily available allows us to iterate faster, track key metrics in real time, and make informed decisions without slowing down the development process. Whether it’s a configuration tweak or a prompt change, we can instantly see the impact by querying the evaluation results. No manual review or custom code is needed, making the process seamless and efficient.
Simplifying Analysis by Turning Development into a Data Problem
By approaching LLM development as a data problem, we’ve greatly simplified the analysis process, making it far easier to diagnose and resolve issues during iterations. With the LLM pipeline in place, we can quickly review evaluation results and identify problems in real time.
For instance, during one update, most of the evaluation results were classified as “Miss.” This immediately flagged an issue that needed attention.
After fixing the broader issue, we noticed a particular name continued to appear as “Miss” in several iterations. While overall metrics didn’t shift dramatically, this specific case warranted further investigation.
Thanks to the structured data approach, we could easily dive into the details. Was the LLM failing due to poor search results, or was the prompt filtering out relevant data? To find out, we queried the “Miss” data in Recce:
select * from {{ ref("stg_results") }} where hit = false
This analysis quickly revealed that the “Headline” field from Apollo’s results wasn’t being utilized by the LLM, which was key to improving accuracy. After incorporating this data point, the performance of the model significantly improved.
Before adopting this data-driven approach, we would have been forced to manually sift through JSON files and cross-check different outputs. Now, by framing the development as a data problem, we’ve streamlined the process, making analysis faster, easier, and far more accurate.
Benefits of Mapping Evaluation into Data
By mapping our evaluation process into structured data, we’ve unlocked several key benefits that enhance both the efficiency and effectiveness of our LLM development.
- Streamlined Evaluation as Raw Data: Each evaluation generates raw data that feeds directly into our data pipeline. This means that evaluation outputs are automatically structured and ready for analysis — no more manual extraction or transformation.
- SQL for Instant Analysis: With the evaluation data in a structured format, we can leverage the power of SQL to query, analyze, and compare results instantly. This removes the need to write custom code every time we want to investigate the output. For example, rather than manually cross-referencing YML or JSON files, we can use simple SQL queries to inspect model performance, discrepancies, or cost efficiencies.
- Faster, Data-Driven Debugging: Without this structured approach, debugging LLM performance would involve sifting through JSON files and writing custom scripts for each new evaluation. Now, we can easily identify issues like missing data or incorrect configurations using SQL, saving significant development time and reducing the risk of human error.
- Use of dbt for Transformation: We use dbt (data build tool) to handle transformations, which further simplifies the analysis process. dbt allows us to create reproducible, version-controlled transformations that can turn raw evaluation data into actionable insights. This means we can easily manipulate, join, or aggregate data to surface relevant metrics without any manual data wrangling.
- Scalability and Automation: By integrating the evaluation data into a pipeline, we’ve set ourselves up for scalable automation. As our models evolve, the pipeline automatically adapts, so we can focus on improving model performance rather than maintaining manual analysis workflows. This makes the entire development process more agile and scalable as new data sources or evaluation metrics are introduced.
- Real-Time Feedback for Iterations: With everything plugged into a data pipeline, we can get immediate feedback from each iteration, enabling us to spot issues or improvements faster. This real-time insight speeds up the entire development cycle, allowing us to iterate more frequently and with greater confidence.
By framing evaluation as a data problem and using tools like dbt and SQL, we’ve not only improved the accuracy of our results but also significantly reduced the time spent on manual analysis, allowing the team to focus on higher-value tasks.
GenAI Development is a Data Problem
In the Reccedog project, by treating evaluation as a data problem, we were able to streamline development, improve efficiency, and accelerate iteration cycles. The structured evaluation data allowed us to make informed decisions quickly, significantly improving both the accuracy and cost-effectiveness of our model development.
We hope this article has inspired you to approach your GenAI projects with a data-driven mindset. As you develop your own GenAI solutions, remember that the key to success lies in how you handle and analyze the data at every stage of the process.
Stay tuned for our next article, where we’ll dive into Braintrust, an evaluation product, and compare its capabilities against our current approach. We’ll be sharing more insights and strategies on how to further optimize GenAI development.
If you found this article helpful, follow us for more tips, best practices, and updates on GenAI development. Let’s keep exploring the future of data-driven innovation together.
—
A special thanks to Chen-en Lu, 🙌 who carried out all the hard work, and CL Kao, whose vision made this project possible. I’m grateful for the opportunity to contribute by writing this article.