AI-generated SQL — Using Cybersyn data & Vanna AI to analyze US population growth — May 2023

Ashish Singal
Vanna AI
Published in
4 min readMay 29, 2023

For release on Mon, May 29, 2023. Happy Memorial Day! Today, we’re going to see how to generate AI from natural language using census data.

Our series of questions

  1. Which state has the highest population in the United States in 2018?
  2. What percentage of the US population was living in California in 2018?
  3. Compare the population growth of the US and California from 2010 to 2018.

Follow along on ask.vanna.ai — create an account (completely free) and navigate to the Cybersyn dataset called “cybersyn-us-global-public”.

If you want to pull this dataset into your own Snowflake account, it’s available free on the Snowflake Data Marketplace—

If you have issues, just hit us on Slack and we’ll get back quickly.

Vanna, Cybersyn, and Snowflake

First, a bit about what we’ll use today — we’ll ask our questions to Vanna, which will translate the question to SQL using an AI, query Cybersyn data sitting on top of Snowflake for the answer, and return the results back to you.

Vanna AI is an AI SQL generator that we recently started working on (been about a month — so excuse the bugs!).

Cybersyn is a cool data provider founded by Alexander Izydorczyk that has a variety of useful and easily accessible data on Snowflake.

Snowflake is a popular data warehouse that we’ll be utilizing for this demo.

Brief look through the data

So let’s take a look at the Cybersyn tables we’ll be using. You’ll see the data catalog to the left —

geography_relationships — this helps us establish the hierarchy and relationships between different geographies (eg so we can understand that this is a state within a country).

geography_index — this is the table that lists all geographic indices.

datacommons_timeseries — this is the primary table — which contains all geographies, variables, and values.

Question 1 — State with the highest population

For our first question, we find the state with the highest population. The query generated joins three tables and finds that California, with a population of 39.4 million, was the highest in 2018.

Question 2 — Percentage of population in that state

A quick and obvious follow up question may be to find the percentage of population in California (turns out to be 12.07%). This is a common pattern — where one question naturally leads to the next question. The generated SQL is actually really complex. It would take a while to write this manually!

Question 3 — Comparing population growth of California to the US

Finally, let’s compare California’s population growth to the US’s. Note that we had to specify a data cleaning step — “The California state has multiple values for the population in each year, take the maximum for cleaning the data.” We also hinted that California is a state while the US is a country to help the AI out in this case.

Incredibly, the auto-generated chart shows exactly what we asked for —

While California’s population growth stayed above the US’s until 2015, it dove way lower into 2018.

Next question, please

Head over to Vanna AI and create an account (completely free) and test out Vanna for yourself. It’s not perfect by any means, and may take a few tries to ask the question in the right way, but it’s way better and faster than writing these complex queries yourself.

And, more important than US population growth, are questions that relate to your own company’s data. You can train an AI model using Vanna specifically for your own company that can answer nuanced questions like a data analyst. Get in touch for a demo!

--

--