Just Ask: Data Insights for Everyone

Ioannis Zempekakis
Prosus AI Tech Blog
10 min readSep 6, 2024

There’s No Such Thing As a Dumb Question: Data Insights in Natural Language.

Authors: Donne Stevenson, Floris Fok, Sean Kenny

Every day, people chase analysts for business insights and answers to their questions

“What’s the 12-month trend for active partners with at least eight monthly orders?”

“How do late-night food orders compare between different cities?”

“How many couriers have received a 5-star review in the last 30 days in the city of London, broken down per vehicle type?

Companies in sectors like food delivery, retail, and payments generate vast amounts of data. Being able to give employees accurate, contextualized, and prompt answers to questions about this data can boost their performance, drive value and build competitive advantage.

Ideally, employees have access to analysts who can turn business questions into data-driven insights. However, good analysts are often costly and scarce, and they need to focus on the most critical tasks. This makes it harder for many organizations to access insights and data for decision-making.

To understand how GenAI frameworks and tools can help address this problem, we teamed up with operational analysts at food delivery platforms iFood and Glovo, e-commerce marketplace OLX, and other companies in the Prosus portfolio.

Introducing the Toqan Data Analyst

The result of our ongoing work is the Toqan Data Analyst, an AI-powered chatbot integration for databases, available to Toqan users in the business messaging app, Slack. The tool makes data and insights accessible to anyone who needs them, when they need them, in natural language.

People can ask questions in a simple chat interface. They don’t need any technical skills, nor be deeply familiar with the databases, since the tool generates SQL queries from their text. Our text-to-SQL solution has hundreds of active users and answered thousands of questions during the roll-out.

In this blog, we dive into how we got to this solution. We explain how we had to rethink the text-to-SQL problem space, how that led to us to expanding our approach beyond existing text-to-SQL solutions, and how we overcame significant challenges along the way.

LLMs are only part of the solution

First, we looked at the use of large-language models (LLMs) for text-to-SQL. Existing solutions can translate questions into computer language, reason through uncertainty at scale, and clarify implicit assumptions.

However, benchmarking available LLMs against SQL datasets suggests this solution would be insufficient. Early frameworks for assessing the performance of models on text-to-SQL focused on a model’s ability to generate SQL queries that address user questions. Top results from WikiSQL’s leaderboards approached 93% accuracy on the test set for accurately generating SQL queries based on natural-language questions. But these early results came from various pre-transformer supervised learning models and focused on highly simplified use cases that lacked joins between tables in datasets, or other real-world complexities.

LLMs proved to be broadly capable of achieving similar levels of accuracy on text-to-SQL tasks. Benchmarks for LLM capabilities, such as Toqan’s ProLLM Leaderboard, peg the latest models at a reasonable acceptance level, with top performers around 85–90% acceptance of SQL responses based on Stack Overflow data.

On paper, AI models should be able to formulate SQL queries with 80%+ accuracy

However, most systems are not adequately able to handle real data and thus inevitably fall short of expectations. Challenges like identifying how to join tables, spotting columns despite poor metadata, and finding every way of spelling a name quickly made these benchmarks largely useless. Our out-of-the-box use of LLMs for text-to-SQL for use cases with more than two tables started with less than 25% accuracy.

There are two fundamental problems. First, even 95% accuracy is not enough for many business users querying large datasets of transactional, supply chain, or even financial data. Second, the ability to get that final slice of accuracy does not appear to be related to a model’s performance, so much as the context that the system has about the person asking the question and the business and data structure being examined.

Context is critical

We like to think that most databases are straightforward; they are relational databases with fixed structures. However, even with the latest SQL knowledge, newly hired data analysts will still struggle initially. This struggle points to a hidden truth: context is critical to translate business questions into effective data queries. In other words, additional context is needed to effectively use business data. People usually gather this context while talking to colleagues, attending meetings, or through trial and error.

When it comes to rule-based or early data analysis AI solutions, many real-world applications fall short not only because of missing context but also because of poor data quality, lack of documentation, user behavior, and more. The reality is that expecting all business users to follow the best practices, work with perfect data, with no implicit or missing information, will never result in a viable solution.

Building a text-to-SQL solution for the real world

Instead of only focusing on generating the right query, we needed to expand our requirements. Working with analyst teams with access to Toqan, we started to experiment with system solutions.

Things to account for include:

● Implicit assumptions and transparency

● Guiding users and clarifying requests

● Handling failed answers and knowing when not to answer; minimizing false positives.

If we consider requirements like these, the challenges with the previous approach become clear. Instead of focusing on writing queries and getting answers, we needed to build a system that has the context and even the ability to inquire about implicit context.

Text-to-SQL is not only an engineering problem; it’s also a communication problem.

From a technology perspective, the next advance we needed was flexible, sequential planning and tool use provided by agent frameworks. While building our AI assistant Toqan, we experienced the incredible reasoning capabilities of agents and were sure that it was a matter of time and effort before we could make text-to-SQL something real.

What makes agents perfect for data analysis

An agent can follow instructions in a chat-like manner, reason and use external services to aid its understanding. This makes it the best problem-solver to date. The traditional chatbot, which answers simple questions, has limited reasoning capability. Also, it needs all the information to answer a question in advance since it only works from memory.

An agent doesn’t have these limitations, however. If we want to make an agent that finds the best deals online, we need to provide it with a way to navigate the internet, and it will do a very decent job. Agents can explore external data and have conversations with themselves like: “I will first do A, then I will solve B to bring you answer X.” These problem-solving capabilities make an agent perfect for data exploration and data-related analysis. When given access to a database and enough context about the data in it, an agent will generate code to analyze the data and answer any questions.

Helpfulness is the enemy of accuracy

Building an agent that can turn a question into an effective SQL query comes with some challenges though.

Let’s look at the first issue: being too helpful. Large Language Models (LLMs) — key components of agents — are smart and can develop logical processes, which means they may find solutions that are not straightforward for the person asking. This is one of the best examples we found:

User: What is the Address of user X.

Bot: Here are the driver’s coordinates during delivery.

The LLM did not have access to addresses. Still, it was smart enough to work out that the driver’s coordinates during delivery must be near the address of user X. Although the system provides an answer, the best answer would be: “Sorry, I do not have access to addresses.”

To help mitigate this problem, we embed the options to use two tools in the agent:

  1. Ask clarifying questions
  2. State assumptions

The first tool tries to remove uncertainty at every step. When we add information to the system, it evaluates whether it’s ambiguous and if so, asks clarification questions. This will make the question more specific and changes users’ behavior in two ways: it ensures expectations are more aligned, and it trains people to ask better questions.

The second tool is designed to ensure that if the agent takes shortcuts or uses creative solutions, the user can spot them more easily. We call this “perceived accuracy,” where given the knowledge it concluded X, answer Y is correct. An example we use frequently for this is:

U: Can you provide the average revenue per day for X in the last month?

B: That was …
Assumptions: “Given a month is defined as from today until the same day last month.”

Since we now know this is how the agent defines a month, we can’t argue that the question is false. So, we either correct it or accept it. Like asking clarifying questions, we teach the agent to ask better questions, aiming to be more explicit in its communication and transparent to users.

Hidden context

The second issue we ran into: people use nuances and context that even the smartest multimodal models struggle with. This is a good example of what can happen when a system moves from concept to real-world testing.

The missing information — or the hidden context — falls into four groups:

  1. Global knowledge: The general structure of the organization, details about the business and its partners, and what kind of system the database is.
  2. Product knowledge: Each table is designed to serve some KPI or product insights; this is valuable information that can pivot the balance between right and wrong at the right moment.
  3. Table and column knowledge: Are tables slow or fast-changing, partitioned, or in a special date format? With this information, the model doesn’t have to reinvent the wheel.
  4. Request language knowledge: Team jargon, acronyms, or other language that’s unique to a specific team.

Gathering this information comes with the engineering challenge of knowing when and what is useful. Some solutions focus on embedding retrieval (ER) but we did not see it as a good fit since ER focuses on matching subjects and topics. We want the agent to see the rules without knowing if they may be needed. We do this by implementing three mechanisms that improve the request:

1. Business Rules

These rules cover the global knowledge and a large part of the product knowledge. This is the simplest but also the most powerful concept. We want the model to always be exposed to this knowledge. To do so, we include this in the system prompt together with other guidelines and behavioral rules.

2. Glossary

In its purest definition a glossary is a list of terms in a special subject, field, or area of usage, with accompanying definitions. “Terms” may also involve acronyms and short phrases. The goal of this is to add these definitions to the request if we find that they are used in the user’s requests.

3. Table selection

Storing a lot of information in the columns and table descriptions is useful. However, if we add all this information to the prompt, we have very little control over changing the behavior if it is not picking up the correct details for a given request. To tackle this, we need a separate system that can judge which information is valuable for the request. This second system will first assess which tables might be useful. Secondly, it collects the columns that are needed from these tables. Thirdly, it assesses if, given the complete overview, it could drop some parts of the result if these are duplicates, or now redundant. Now we can add very specific information and even instructions into descriptions without these influencing any request where they are not needed.

By combining these three systems we can control the information going into the text-to-SQL agent.

Testing our SQL agent

Lastly, some words on testing (more on testing here ). Testing has many advantages, with the two most important being:

● Confidence to make changes,

● Direct feedback from your changes (fast iteration cycle)

We saw a real challenge in creating a test set that represents the production use cases. In this blog, we highlight how we achieved automated testing. For the test set, we need a question and query.

We start by running the “ground truth” query to fetch the most recent data, ensuring our agent’s answers are based on the latest information. Next, we ask the agent the same question multiple times in a slightly different way. For example, “What is the capital of France?” and “Which city is the capital of France?”. This helps us evaluate the agent’s robustness.

We use a prompted model to check if the answers are similar, focusing on the core response and ignoring minor differences. For each question, we measure:

Accuracy: How close the agent’s response is to the expected answer.
Consistency: How uniform the responses are across different question formulations.

Based on these metrics, we identify areas for improvement. High accuracy but low consistency indicates correct answers but with variability. Low scores in both suggest the need for further training. These insights help refine the agent’s training data and algorithms.

Finally, with automated testing in place, we focus on improving the framework.

Next steps

We’re working on making our SQL agent even better. Some of the things we’re looking into:

Self-learning:

We want to use successfully generated or upvoted SQL as a source for generating snippets that resemble (parts of) requests.

Reformulating the request before execution:

The agent will reformulate your question to one that is non-ambiguous. By putting more effort into the request before execution, we limit the number of queries to the database.

Validation agent:

By asking the agent “Are you sure?” we can make sure the query is correct. We are looking into automating this by adding a supervisory validation agent.

This is just the beginning, stay tuned for updates!

PS: If you would like to know more, please get in touch with our team via LinkedIn or through Toqan.ai

--

--