How Intuit data analysts write SQL 2x faster with internal GenAI tool

Robin Oliva-Kraft
Intuit Engineering
Published in
6 min readMar 27, 2024

Speed to insight is critical for data analysts and decision makers at today’s enterprise companies across myriad industries. For Intuit, turning data into actionable insights is pivotal to our success in delivering awesome experiences to 100 million consumer and small business customers with TurboTax, Credit Karma, QuickBooks, and Mailchimp.

That’s why we’ve developed an internal generative AI (GenAI)-powered tool called Query Kickstart to improve speed to insight by accelerating SQL query authoring for our data workers.

In the throes of peak tax season this year, our technologists can tap into an internal data discovery application to look up a data lake table, use Query Kickstart to ask a business question (bundled with a data dictionary behind the scenes), and retrieve a draft SQL query in a matter of seconds.

Back in summer 2023, early versions of Query Kickstart showed great promise, and we thought it would improve speed to insight. But in user testing and automated benchmarks, we uncovered a disappointing number of mistakes and hallucinations. So we initiated a study to better understand whether the productivity benefits would outweigh accuracy concerns.

In a study with 25 data analysts from across Intuit, we found that Query Kickstart users were able to write SQL queries 2.2x as fast (or in 55% less time) as those who did not use it. They also completed 22% more tasks, despite having to correct mistakes and hallucinations when refining GenAI-generated queries. This productivity boost gave our team confidence that the tool could be deployed to data analysts, software engineers, and others querying data across Intuit.

The rest of this post will describe the study and share results and observations.

Study design

Our study was modeled loosely after Noy and Zhang (2023) and a Github Copilot evaluation (2022), and designed to ensure data analyst participant variability across business units, Intuit tenure, and analytics experience. We tested three hypotheses developed during user testing:

  1. Query Kickstart improves productivity with current accuracy.
  2. Query Kickstart is more helpful for analysts with less experience or shorter tenure at Intuit.
  3. Query Kickstart is more useful when applied to unfamiliar data.

Participants, half of whom had access to Query Kickstart, spent up to an hour on tasks that were representative of their day-to-day work, including writing Spark SQL in response to a business question and translating a Spark SQL query into PySpark code. To control for domain expertise, three of the tasks dealt with data that was unfamiliar to participants, while one task allowed participants to choose an option tailored to their business unit.

Results

Hypothesis 1: Query Kickstart is useful with current accuracy.

This hypothesis was validated. Query Kickstart users on average completed the tasks 2.2x faster (or spent 55% less time) compared to the control group. This speed-up is in line with the Github Copilot evaluation. More specifically, Query Kickstart users completed each task in 7.4±1.5 minutes, with a 96% completion rate. Users in the control group completed each task in 16.2± 2.9 minutes, with a 79% completion rate.

For the Spark SQL to PySpark translation task, users with AI assistance were 1.7x as fast as the control group and completed the task 90% of the time. The control group, however, was able to complete the task only 42% of the time.

Based on this, we estimated that regular use of Query Kickstart could save analysts as much as 6 hours per week on query-authoring tasks.

✅ ❌ Hypothesis 2: Query Kickstart is more helpful for analysts with less experience or shorter Intuit tenure.

This hypothesis was partially validated. We found that more junior analysts saw the largest productivity boost compared to more senior analysts (2.7x vs. 2.0x as fast as the control group). Unexpectedly, Intuit tenure did not appear to affect the productivity impact. Analysts with <=2 years of tenure at Intuit saw roughly the same benefit from Query Kickstart as those with >2 years of tenure (2.2x vs. 2.3x as fast as the control group).

Note: the small size of these subgroups may be driving this result.

✅ ❌ Hypothesis 3: Query Kickstart is more useful when used with unfamiliar data.

This hypothesis was partially validated. Participants using Query Kickstart were 2.4–3.5x as fast as the control group when working on the two questions that used unfamiliar data. In feedback, some participants specifically pointed out the utility of Query Kickstart for unfamiliar data.

For the question tailored to specific analysts’ business units, where domain expertise and data familiarity could help, we saw mixed results. Quickbooks analysts who relied on Query Kickstart reported times 2.7x as fast on average as Quickbooks analysts in the control group. Conversely, there was no impact on response times for the question about TurboTax answered by TurboTax analysts in either group.

Note: the small size of these subgroups may be driving the results.

Observations

  1. Optimize for metrics that drive your desired outcome

Concerns about GenAI accuracy and hallucination are important but can be limiting. As mentioned above, we were worried that our tools wouldn’t be useful because of disappointing benchmark tests to detect hallucination and other mistakes. This study of productivity impact gave us confidence to move forward with UX investments even without substantial improvements to accuracy. As we’ve seen with the explosive adoption of AI chatbots, when something is useful people can work around mistakes.

2. People make mistakes with and without AI, but the type of mistakes may differ

The fact that unassisted humans make mistakes in their work can get lost in discussions of AI accuracy. And while this was not the focus of our study, we noticed that some participants in both groups submitted results with mistakes.

Anecdotally, the types of errors appeared to be different in each group. Query Kickstart users might submit queries that use plausible-sounding but incorrect columns, such as the wrong date column, or be overconfident in AI results and “complete” a task in seconds (these rapid responses were not included in the results reported above). On the other hand, queries from the control group might not fully answer the question due to a missing WHERE or GROUP BY clause.

3. Excellent data documentation and “clean data” improves results

In our accuracy benchmark tests, we noticed that including data documentation in the prompt made Query Kickstart 44% less likely to hallucinate non-existent columns, and 37% more likely to use the correct columns. As Intuit continues to invest in clean data and data mesh, accuracy of SQL generation should over time, even without LLM or prompt improvements.

4. Investing in UX can provide better ROI than improving accuracy

Query Kickstart has been available in our internal data discovery tool since September 2023, but it and other GenAI tools were initially buried in a conversational experience and we saw lower-than-expected adoption. Recent changes to embed the GenAI tools in existing workflows appear to have driven a 5x increase in repeat usage since January 2024, without corresponding investments in internal marketing or accuracy.

Conclusions

Measuring productivity was difficult and time-consuming, but this study was worth it as it has been the best way for us to validate the potential impact of Query Kickstart in data workflows. While accuracy benchmarking is critical, our internal users are expected to refine the output of Query Kickstart, not trust it blindly. So it has been important to consider and design around other factors that may also contribute to the value and impact of GenAI-powered tools, including ease of access, conversational UX vs. dedicated workflows, and each use case’s resilience to errors.

We hope that this study provides a useful data point in the literature on the workplace impact of generative AI. We also hope that its limitations will help inspire the industry to develop more automated evaluation methods that can be deployed in workplace settings and support larger groups of participants. This would allow teams like ours to quickly measure the productivity impact of changes like prompt refinements or switching LLMs, while also allowing robust analysis of subgroups like those identified in hypotheses 2 and 3.

Going forward, we’re making improvements to our Query Kickstart tool as we learn more from our users. We’re also designing another study to measure both accuracy and productivity in tandem, so that we can appropriately balance ongoing investments in baseline accuracy, user experience, and feature enhancements.

So, stay tuned for updates here!

--

--