Data Co-pilot and Data Concierge

Published in

data.world

7 min readMay 3, 2024

A few weeks ago, I posted on LinkedIn “When it comes to answering questions over enterprise SQL databases, I distinguish two different approaches: Co-pilot for technical users and Concierge for business users”. That post garnered a lot of discussion. The overall sentiment was an agreement on the distinction, acknowledging that copilot is the first step and that the concierge is, and has always been, the holy grail, and is hard to achieve.

Quick historical reminder: the field of Question Answering, arguably, is one of the inspirations for the field of computer science. This is a problem that computer scientists have been tackling for close to half a century.

Data Co-pilot

Data Co-pilot is the marketing term for Text-to-SQL. Modern Data Co-pilots take as input a very specific question and returns a SQL query, with a possible technical query explanation. A user interacting with a co-pilot is like asking a question to a person who has limited knowledge about the data.

The target users for Data Co-pilots are technical users. When a SQL query is returned, the technical user decides to leave as-is/edit/extend, then execute the query to get an answer. The technical user will usually have knowledge about the database schema (and if they don’t, then they are probably just hoping for the best). This copilot feature may also manifest as query completion similar to the GitHub co-pilot approach.

The Data Co-Pilots of today are LLM based. This can be a foundational LLM or a fine tuned LLM for SQL. The metadata of tables and columns are passed in the prompt, thus these copilots are usually constrained to a handful of tables.

The accuracy of Data Co-Pilots is low. In our benchmark research, we showed that with a zero-shot prompt on GPT4, the accuracy of questions on low complex schemas (less than 5 tables) is between 25%-37%. If there are more than 5 tables, the accuracy is 0%! These results are consistent with what other vendors are announcing. For example, the Snowflake team shared their co-pilot approach and results: “We have combined a fine-tuned Snowflake SQL generation model and Mistral’s new state of the art Mistral Large model to create a new text-to-SQL engine for Snowflake Copilot.” The results are: “Delightfully, this combined architecture is greater than the sum of its parts, achieving 46.4% on Execution Accuracy.”

Also, it is important to note that many accuracy tests being reported by Text-to-SQL vendors are using academic benchmarks like Spider. As valuable as they are, it is paramount to acknowledge that they are disconnected from enterprise reality (low complex questions on low complex schemas). This is one of the motivators for our benchmark that leverages OMG’s Property & Casualty Insurance model, and where we divide the questions into four quadrants based on the low-high complexity of questions and schema.

Even though the accuracy for Data Co-pilot isn’t that high, it’s passable because the user is technical — they understand the schema and can edit the query. This doesn’t bring data to more people, it just speeds it up for the people who would have gotten the answer anyway.

Data Co-pilots focus on the syntax; making sure that the returned SQL query can execute. For example, some approaches have a self-correction loop to fix queries that are syntactically incorrect. How does a user know if the syntactically correct SQL query is semantically correct? Does the query represent the intention of the user and return the correct answer, even though it executes successfully? A technical user that has no business context may not have a way of verifying that.

Explainability for Data Co-pilots is constrained to providing a textual description of the query. Thus the explanation is technical: we need table A and use column B to join it with column C that is part of table B. Frankly, these “explanations” are describing what is happening which is already evident to any technical SQL user. Thus, in my opinion, not very useful.

Data Co-pilots are a low hanging fruit. LLMs have commoditized this feature; a simple LLM wrapper can do a decent job (see the prompt for Text-to-SQL in our benchmark paper). At data.world, we added this text to SQL functionality a year ago! All database vendors are now adding these capabilities (actually, who hasn’t)? Langchain already incorporates Text-to-SQL. There is almost a new startup every week doing Data Co-pilots. Tech companies are creating their own co-pilots. This proliferation is an indicator of the low hanging fruit. The question remains: how valuable is it? How much productivity uplift will be achieved and is it worth the investment? Time will tell and I’m sure it’s going to be pretty soon.

Data Concierge

Data Concierge is the holy grail that executives have always wanted: the user asks a question as input and the output is an answer that they can trust. Users interacting with a concierge is like asking a question to a person who has extensive knowledge about the data and the business context overall.

The target users for Data Concierge are users in lines of business (executives, finance, sales, marketing) and consumers (the customers, lawyers, patients, etc). The users do not have knowledge about what is going on underneath the hood, and don’t need to. Users can ask follow up questions. Answers can also be also “I don’t know”, “can’t answer the question because I don’t know about X” or “there are five different definitions of customers, which are the following …” or “you should talk to Alice”.

A Data Concierge system consists of intelligent agents that leverage a knowledge graph and can autonomously follow a plan to answer the question. These agents leverage sophisticated state machines. Non-deterministic (i.e. LLMs) and Deterministic (i.e. formal algorithms) approaches are used throughout various states.

For example, a question-answering intelligent agent may go through the following states:

Determine the concepts in the question
Retrieve the context from a knowledge graph. What are the concepts, attributes, relationships in the ontology (ie semantic layer) that correspond to the questions? Are there mappings from the source database to the ontology? Is something missing?
With the given context, does the original question need to be rewritten?
What type of question is it? Is this a subjective or fact based question? Is this about data or metadata? Has this question already been asked before? Each of these decisions can take you down a different route.
Generate a SPARQL semantic query based on the question.
Is the generated SPARQL semantic query syntactically correct?
Does the generated SPARQL semantic query match the ontology, meaning it is semantically correct?
Use the mappings to convert the SPARQL query to a SQL query
Execute the SQL query on the database

Note: This is a subset of what happens inside the Question Answering agent in the data.world AI Context Engine.

Different intelligent agents can interact with each other, thus Data Concierge systems are actually multi-agent systems. For example, another type of agent is tasked to do Knowledge Engineering. Given existing context in a data catalog, it can suggest what are the concepts and attributes that should be added to the ontology and what the mappings should look like. The agent can be invoked by the question answering agent when it determines that there is missing context in the knowledge graphs.

The point being that Data Concierges are Neuro-symbolic AI systems that leverage both LLMs and GOFAI (Good Old Fashioned AI).

A Data Concierge leverages a knowledge graph which is how it provides higher accuracy. Our benchmark research provided evidence that LLM accuracy when answering questions on enterprise SQL databases increased 3X with knowledge graphs. That is why investing in knowledge graphs and a data catalog, that provide the context of your organization, are foundational for a Data Concierge.

The ontology defining the semantics of the data in a knowledge graph is a critical part to ensure that queries are not only syntactically correct, but more importantly, semantically correct. In the case that an LLM generates a semantically incorrect query, one of the states of the agent can check the query against the ontology, and further try to repair it. This process further improves the accuracy.

Explainability is achieved because the knowledge graph is governed. Everything in the knowledge graph exists for a reason and should have stewardship approval (which provides further explanation). If an agent finds it in the knowledge graph, it knows it exists. If it doesn’t find it in the knowledge graph, it knows it doesn’t exist. Therefore a Data Concierge should know what it knows and more importantly, knows what it doesn’t know!

This combination of accuracy, explainability and governance is what provides trust to a user of a Data Concierge.

Data Concierge provides value to users in the lines of business and end consumers, a much wider audience than just technical. It’s not just about improving productivity. It’s enabling what has previously been considered impossible. It’s about changing the entire game!

Today, when business users need to ask questions that they can’t self serve in a dashboard or report, they have to ask the data team who is swamped with requests and takes too long to provide an answer, which might then be too late. A Data Concierge enables answering what I call “just-in-time questions” (inspired from just-in-time inventory management) that would need to be answered by data engineers (thus minimizing their backlog requests) and receive trusted answers instantly (increase efficiency).

We acknowledge that an investment in semantics and knowledge graphs must be made, but it can start small. In our experience throughout a series of hackathons in the past six months, we have been able to set up a Data Concierge in a couple of days. Start small by focusing on a few business questions that need to be answered.

Conclusion

The market today is focused on Data Co-pilots. Makes sense because it’s low hanging fruit, the problem is constrained (fewer tables, accuracy doesn’t have to be high). Tech companies are addressing this first because they are in their comfort zone, addressing technical problems for technical users. The focus is on productivity.

Data Concierge changes the game completely. It gives every business user a trusted advisor to get instant and trusted answers. Make the impossible now possible. As my colleague Patrick Frasier mentioned to me: “sometimes the best questions come up in a moment of curiosity and deep creativity and if you don’t get an answer in that moment, the magic can be lost.”

Let’s not lose the magic. We now have the tools to make the impossible, possible!

Data Co-pilot and Data Concierge

Data Co-pilot

Data Concierge

Conclusion

Written by Juan Sequeda