How to Perform NL-to-SQL using Dataherald AI with Snowflake

Ainesh Pandey
Dataherald

--

Dataherald AI is a tool that adds a semantic layer on top of your data warehouses to enable natural language question answering. Using Dataherald’s intuitive APIs and best practices, anyone can build native applications servicing non-technical stakeholders (both internal and external), unlocking the potential of your production databases. Read on to learn how the tool interfaces with Snowflake.

Pre-requisites

  • A Snowflake account, set up with the data you want to avail to DHAI. For this example, we’ll use real estate data from Redfin and RentHub with the following tables:
    — redfin_homes_sold
    — redfin_inventory
    — redfin_median_list_price
    — redfin_median_ppsf
    — redfin_median_sale_price
    — redfin_new_listings
    — renthub_median_rent
  • An app built to on the Dataherald API service. For this example, we’ll provide Python code that makes REST API calls to the Dataherald AI service

Setting Up Snowflake Connection with DHAI

Our tool needs to connect to the data in Snowflake and understand how to use it to answer NL questions. To do so, do the following:

Establish Connection to Snowflake

You’ll need to generate a connection URI for your data warehouse instance, which can be done either from the admin console’s Databases tab or through the SDK. Otherwise, you can use the/table-descriptions POST endpoint.

alias = "Snowflake_RealEstate"
use_ssh = False
connection_uri = "snowflake://<user>:<password>@<organization>-<account-name>/<database>/<schema>"

db_connection = dh_client.database_connections.create(
alias=alias,
use_ssh=use_ssh,
connection_uri=connection_uri
)

Scan Tables

Next, Dataherald AI will now scan all of the specified tables, performing tasks to better understand how to use the data. The tool will:

  • read the schema to infer data types and the type of content contained in the table
  • extract identified categorical variable values
  • determine relationships between tables

Use the /table-descriptions/sync-schemas POST endpoint to start the scan. You may specify tables in the endpoint, or just have the tool scan all available tables. The SDK code would look like this:

db_connection_id = db_connection.id
table_names = [ "redfin_median_sale_price", "redfin_median_list_price", "redfin_median_ppsf",
"redfin_homes_sold", "redfin_inventory", "redfin_new_listings", "renthub_median_rent" ]

dh_client.table_descriptions.sync_schemas(
db_connection_id=db_connection_id,
table_names=table_names
)

Querying your Data in Natural Language

DHAI can now be leveraged to answer questions. The tool can generate the SQL to answer the question, extract the relevant slice of data, and then convert the response back to natural language for the user. It utilizes the following API endpoints:

  • /prompts creates a prompt, without trying to generate any SQL code or natural language response
  • /prompts/sql-generations creates the prompt and generates a valid SQL query in response
  • /prompts/sql-generations/nl-generations creates the prompt, generates a valid SQL query, and generates a natural language response from the results of the query

The below SDK code makes use of the above endpoints.

sql_generation = {
"prompt": {
"text": "Which county in California has the highest number of townhouses on sale?",
"db_connection_id": db_connection.id
}
}

response = dh_client.nl_generations.create(sql_generation=sql_generation)
nl_generation_id = response.id
sql_generation_id = response.sql_generation_id
prompt_id = dh_client.sql_generations.retrieve(id=sql_generation_id).prompt_id

print(f"Question: {dh_client.prompts.retrieve(id=prompt_id).text}")
print(f"NL response: {dh_client.nl_generations.retrieve(id=nl_generation_id).text}")
print()
print(f"Generated SQL query: \n{dh_client.sql_generations.retrieve(id=sql_generation_id).sql}")

Providing Additional Context to DHAI

While the generated SQL query may be valid and seems reasonable, it may not actually correctly answer the question, primarily because we haven’t provided important context. We can add the following context:

Table/Column Descriptions

Descriptions aide the tool in determining the right tables or columns to choose to answer certain questions. Text like “This table contains monthly inventory data for various types of property across geographic regions” or “This Boolean column identifies if the data point is seasonally adjusted” help the tool understand which tables or columns are relevant to certain questions being asked. Use the /table-descriptions/{table_description_id} PATCH endpoint to update table and column descriptions. The SDK code would be as follows:

db_connection_id = db_connection.id
table_name = 'redfin_median_sale_price'
table_description = dh_client.table_descriptions.list(
db_connection_id=db_connection_id,
table_name=table_name
)[0]

id = table_description.id
description = "Contains median sale price for specified geo, property type, and time period"
table_description = dh_client.table_descriptions.update(
id=id,
description=description
)

Database-level Instructions

We can provide database-level instructions that capture organizational nuances in query writing. For example, an “active customer” may be one that logged onto the platform in the past week, made a purchase in the past quarter, or satisfies some other specific criteria. Let’s add a list of instructions to the database using our SDK.

db_connection_id = db_connection.id
instructions = [
"Never run an aggregate function on the metric_value column (AVG, SUM or COUNT). The metric_value column already includes aggregated or averaged data.",
"If and only if geo_type is 'zip filter date ranges by period_end not period_start. The date should be the last date in the month.",
"Always filter by 'property_type' in the where clause. If the requested property_type is unclear from the question set it to 'All Residential'.",
"If the metric being asked is price per square foot, filter any value over 15,000.",
"If the geo_type is 'city' or 'county' always include the dh_state_name in the select part of the query.",
"If you are doing a JOIN on geo_type of 'city' or 'county' you should also match the dh_state_name column.",
"If the requested geo_type is unclear from the question set it to 'national'.",
"Random instruction that we don't need."
]

for instruction in instructions:
dh_client.instructions.create(
db_connection_id=db_connection_id,
instruction=instruction
)

We can provide “golden SQLs”, or verified question <> SQL pairs. The more representative and varied golden SQLs we provide, the more accurate and reliable our tool’s generated SQL.

db_connection_id = db_connection.id
prompt_text = "what was the most expensive zip code to rent in Los Angeles county in May 2022?"
sql = """
SELECT location_name,
metric_value
FROM renthub_median_rent
WHERE dh_county_name = 'Los Angeles'
AND dh_state_name = 'California'
AND property_type = 'All Residential'
AND period_start='2022-05-01'
AND geo_type='zip'
ORDER BY metric_value DESC
LIMIT 1;
"""

body = [
{
"db_connection_id": db_connection_id,
"prompt_text": prompt_text,
"sql": sql
}
]

dh_client.golden_sqls.upload(body=body)

Conclusion

With Dataherald, you can enable apps to ask questions in natural language, removing technical barriers around gaining insights from your data. Get started now by contacting Dataherald here!

About Dataherald

  • Sign up for free and use the hosted version of Dataherald
  • Our open-source engine is available on Github.
  • Join our Discord server to learn more about the project.

--

--

Ainesh Pandey
Dataherald

Ainesh is the Founding Data Product Manager for Dataherald.