Using the TPC-H sample dataset on Vanna (demo-sales)

Ashish Singal
Vanna AI
Published in
2 min readJul 9, 2023

Vanna makes extensive use of the TPC sample dataset. This dataset the TPC-H dataset that resembles a small business, with customers, orders, and suppliers, that comes default with each installation of Snowflake.

It is regularly used as a performance testing dataset, but we’ll use it as a test dataset for generating SQL using Vanna.

Before we dive in, let’s think about some common questions we may want to ask —

  1. What is the number of customers in each region?
  2. What are the names of the top 10 customers in terms of total sales?

What’s in the TPC data (tables & columns)

Let’s take a quick peak at what’s inside this dataset.

Here’s the structure provided by Snowflake. Customers create orders, which have line items, which are parts that are supplied by suppliers.

Training the dataset

We train the data using this JSON containing question / SQL pairs. There are test SQL queries available publicly. Here’s an example of some question / SQL pairs —

To see how to implement this in a notebook, pls see Getting started with Vanna and the vn-full.ipynb notebook.

Asking questions

Now that we’ve trained this Here are some questions to ask —

--

--