Unlocking insights — Natural language QnA on structured data

Published in

Google Cloud - Community

6 min readSep 11, 2024

Enterprises today are looking to democratize their structured data sources, such as data warehouses and data lakes, to more and more users. This is crucial because it empowers a broader range of employees, from business analysts to decision-makers, to directly access and analyze critical data without relying on technical specialists.

Traditional approaches to data analysis often involve complex SQL queries or the creation of specialized reports, which can lead to bottlenecks and delays in decision-making. Natural language Q&A on structured data bridges this gap by allowing users to ask questions in plain English and receive accurate, data-driven answers.

We have identified multiple benefits:

Accelerated Decision-Making: By enabling users to interact with data conversationally, businesses can significantly reduce the time it takes to gather insights, leading to faster and more informed decisions.
Increased Productivity: Self-service data access empowers users to find the information they need independently, freeing up IT resources and improving overall productivity.
Democratization of Data: Natural language Q&A removes the technical barriers to data analysis, making insights accessible to a wider audience within the organization.
Improved Data Literacy: By encouraging interaction with data through natural language, organizations can foster a culture of data literacy and empower employees to make data-driven decisions.

Furthermore, in our customer engagements, we have seen a growing interest in:

a) Creating new applications based on their structured data for their end users,

b) Sourcing multiple data sources (structured and unstructured) for new interfaces,

c) Chat bots with a new integration to structured data instead of leveraging specific API calls

Leveraging large language models for SQL generation is a big part of this democratization.

Enterprise challenges

However, enterprises face multiple challenges in building an end to end architecture against a structured data source. These include:

Nature of LLM: SQL generation is expected to be highly deterministic in nature due to the structured data source in play. However LLMs are probabilistic in nature. Generating an accurate SQL query is challenging and the same SQL query can be written in multiple ways.
Schema Challenges: Foundation models aware of data structures and schemas offer a starting point, but fall short without business context. To generate truly meaningful insights, we must find ways to seamlessly incorporate business logic.
SQL generation challenges: SQL generation complexity increases with dialects , versions of databases. This is due to the varied nature depending on the source database. Dialects, versions of the database could be many and addressing them through the SQL generation is needed.

State of the industry

Current industry benchmarks such as BIRD are currently at a 69% execution accuracy despite the continuous research. Given the nature of a deterministic outcome needed, this field will continue to be in research.

Furthermore, our own benchmarking results indicate a high variation of the execution accuracies depending on the database that is being used. We have tested against BIRD and have comparable results

Solution Approach

This solution accelerator(reusable asset for customers and partners), Open Data QnA, attempts to address the challenges organizations face when dealing with complex schemas, the probabilistic nature of LLMs, and the need for accurate and contextually relevant answers.

The solution involves creating embeddings of database schemas and using these embeddings to retrieve relevant tables and columns when a user poses a question. The system then generates SQL queries, validates them through a dry-run against the database, debugs the SQL query if necessary, and executes them against the source database to retrieve the necessary information. The results are presented to the user in natural language, along with visualizations if requested. This is enabled through a Multi-agent architecture depicted below.

The architecture for this solution accelerator and the relevant code can be found here.

Key features of the solution include:

Multi-turn Query Handling: Allows for back-and-forth conversation with the system to refine and clarify questions, just like in a chat, making data exploration more intuitive.
Analyze and Validate Resultant SQL Query: Ensures that the SQL query generated from your natural language question is correct and can be executed against your database, preventing errors.
Table Grouping: Gives administrators control over which tables users can access, enhancing data security and governance.
Data Sampling: Uses a portion of your data to optimize query accuracy, especially for large datasets, balancing speed and precision.
Data Summarization: Presents the results of your query in clear, natural language summaries, making insights easier to understand.
Resolving Ambiguities: Asks follow-up questions to clarify any ambiguities in your query, ensuring accurate results.
Visualize SQL Results: Automatically generates charts and graphs based on your query results, enabling faster and more intuitive understanding of data patterns and trends.

Getting Started

To get started there are two essential steps: injecting contextual awareness into the system and improving overall system performance.

Injecting Contextual awareness in the system:

To create the contextual awareness of the data source and the business context, we propose creating an embedding layer that will be populated with the embeddings of the table and column schemas along with descriptions.
We also create a query cache with ‘Known Good SQL’, consisting of working SQL queries along with their paired natural language questions stored as embeddings. This layer can be pre-populated with frequently used user questions and can further be leveraged as a caching layer at application runtime.
We provide customers an option to create this contextual awareness through the technology stack of their choice, including Cloud SQL pgVector, AlloyDB pgVector, and BQ Vector Search.

Improving overall performance:

SQL Generation

Let’s see what happens behind the scenes when generating SQL:

When the user enters a natural language query, the solution searches the query cache for the ‘Known Good SQL’ to look for the closest semantic match. If a result is found, the solution skips the LLM calls and directly runs the known SQL query. This step provides cost control and flexibility for enterprises to create a standardized approach on how they would like the query to be generated and run.
If a result is not found in the ‘Known Good SQL’ query cache,
90% — 98% similarity threshold is set to retrieve the queries that are similar in nature to perform few shot prompting
We then leverage a RAG (retrieval augmented generation) approach to retrieve the relevant table and schema details based on the semantic question.
With the retrieved column and table information, the LLM is accessed to generate the SQL query.

Post Processing & Validation

The LLM-generated SQL should not be run in a production environment before proper validation and post-processing. Thus we recommend a multi-step validation and debugging approach: the generated SQL query will be assessed by another LLM for validity and the query will be tested against the database. If both checks come back without any errors, the generated SQL will either be executed or displayed to the user — depending on the initial configuration settings.
If one of the tests fails, however, the generated SQL will be passed to another LLM agent for debugging, along with additional information such as error messages and validation outputs as a self-correction mechanism. The debugger iteratively tries to modify the generated SQL query to resolve the previous errors based on the number of retries configured in the system. This approach helps identify hallucinations, any issues with the SQL response, and allows to incorporate access control mechanism before the generated SQL query is run against the database.
If, after running step 7, the LLM returns a correct and working SQL query, the result will be displayed or executed (again, depending on the configuration settings).

Human centered AI for improved outcomes and accuracy

As with most AI systems, we recommend leveraging the solution to augment the human user instead of fully automating the pipeline. This collaborative approach builds trust in the system and ensures optimal results through detailed evaluations and direct user feedback. Evaluating results against established ground truths is vital during development. In production, a human-in-the-loop approach allows approved AI-generated SQL to be added to the ‘Known Good SQL’ query cache, continuously improving the system.

This solution accelerator helps enterprises:

Address Diverse User Needs: Catering to internal and external business users with structured data enhances the overall experience and guides solution architecture.
Unleash Insights: Go beyond data extraction to interact with data, unlocking deeper analysis.
Build Trust and Accuracy: Proposed validation and self-debugging process in the solution ensures reliability.

Customer adoption

We have seen improved efficiencies with customer adoption, most recently Copel integrated the application with SAP data. Here is the case study

Try it out

Check out this repository for this solution accelerator with examples
Check out this quick walk through of the UI and the solution

Thanks to the entire team in making this possible!

Technical Solutions Manager: Kanchana Patlolla

Team Lead: Meltem Subasioglu

Solutions Architects: Rajesh Thallam, Moksh Atukuri , Sri Lingamneni, Guruprasad S, Chiranjeevi Raghavendra , Varunika Gupta

Technical Program Manager: Vivek Kalbhut

User eXperience: Khulan Davaajav, Melis Aydemir

Unlocking insights — Natural language QnA on structured data

Written by Kanchana Patlolla