How Vanna works, how to train it, & data security

Ashish Singal
Vanna AI
Published in
3 min readJul 9, 2023

Vanna is a Python-based AI SQL co-pilot. Our initial users are data-savvy data analysts, data scientists, engineers, and similar people that use Vanna to automate writing complex SQL.

Vanna significantly outperforms using ChatGPT or similar tools in writing SQL because we train a specific Vanna model on your data. But Vanna’s effectiveness depends on how well you train it. We’ll dive into how Vanna works and how to train it effectively so it can best answer your questions.

Vanna’s architecture

Diagram of how Vanna works

When you ask a question to Vanna, here’s what happens —

  1. We first search for similar questions that have been asked in the past in our training set.
  2. If we find similar questions, we pass the validated SQL from those questions to our model.
  3. Otherwise, we pass in the DDL, documentation, or bootstrapped queries.
  4. Then, the Vanna model unique to your schema generates the SQL. This is where the “magic” happens.
  5. We then run the SQL to validate it. If it’s validated, it goes into the training data.
  6. Otherwise, an analyst can correct the SQL and put it in the training data.

Over time, Vanna continues to improve its understanding of your schema and gets more and more questions right.

How Vanna generates SQL using AI

Vanna uses a form of generative AI called LLMs (large language models). In a nutshell, these models are trained on huge corpuses of data (including a bunch of SQL queries available online), and work by predicting the most likely next word or “token” in a response to a prompt. Vanna optimizes both the prompt (using embedding searches via vector databases) as well as fine tuning the LLM model to generate better SQL.

There are a bunch of different LLMs that Vanna is using and experimenting with to get the most accurate results. OpenAI’s GPT models often outperform, but there are cases where Google’s Bard, Meta’s LLAMA and Falcon models are best.

(If you are up for more technical discussions about LLMs, we are too — join our Slack and start a discussion, or message us on LinkedIn).

Training Vanna effectively

In order to train Vanna effectively, we need to give Vanna question and SQL pairs that are similar to what users will eventually ask. Think of Vanna like a super smart, but super fresh, junior analyst. Vanna will remember what you teach her (teach her once and she’ll never forget), but you still have to teach her.

As an example, let’s take the TPC-H dataset and ask Vanna —

What are the top customers by sales in the United States?

Vanna will go search through previously asked questions and use them as references.

Therefore, the more examples (eg question / SQL pairs) you give Vanna that are similar to questions users will ask, the better the performance of Vanna will be. And as users ask questions, keep feeding the pairs back into Vanna so your model improves over time.

Data & security

We understand the strict requirements for enterprise data security and privacy and have built Vanna specifically to respect these boundaries.

We take efforts to do the following —

  1. Minimization. Offer as much functionality as possible by sending as little data as possible outside your company.
  2. Transparency. Where functionality would be enhanced by sending data to Vanna, we are transparent in terms of what gets sent and how we use it.
  3. Deletion. We make it easy to see what data Vanna retains and how to safely and completely remove it.

In general, most of Vanna’s functionality does not need your specific data, but does need your schema (eg SQL queries and structure — table names, column names).

--

--