LLM-Powered Snowflake Data Crawler

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readJun 6, 2024

Use GenAI to create a catalog of your Snowflake tables and views

When creating a table in Snowflake you always add a comment detailing the table’s purpose or contents, right? Don’t worry…your secret is safe with me. Admittedly, this is a rarity for me too. Luckily, we can easily build a utility that leverages Snowflake’s Cortex Large Language Model (LLM) functions to update our table comments and generate a simple data catalog of our data artifacts.

Snowflake Cortex LLM

Snowflake Cortex gives you instant access to industry-leading LLMs. Since these LLMs are fully hosted and managed by Snowflake, using them requires no setup. Your data stays within Snowflake, giving you the performance, scalability, and governance you expect.

Snowflake Cortex features are provided as SQL functions and are also available in Python.

Data Crawler Overview

The Data Crawler utility is a Snowflake Stored Procedure that prompts a Cortex LLM to generate a natural language description of every table/view contained in a given Snowflake database and/or schema. The generated descriptions are stored in a catalog table and can be written to the tables comments directly (making user-entered comments a thing of the past).

With the advent of semantic search and semantic modeling for GenAI use cases, natural language description of underlying data is critical. The generated catalog can be easily searched, reviewed, and revised by team members in a Streamlit in Snowflake (SiS) app.

Crawler Setup

The Data Crawler consists of Python logic encapsulated in a Snowflake Stored Procedure. Setup, or more specifically registration of the Stored Procedure in your Snowflake account, requires uploading Python scripts to a Snowflake stage and registering the Python script functions as Snowflake Stored Procedures. This simple 3-step registration process is shown below and only needs to be done once.

It is recommended to use VSCode with Snowflake extension or SnowCLI for the below setup steps.

Step 1: Download or clone this the sfguide-data-crawler repo locally.

Step 2: Update QUERY_WAREHOUSE insql/03__app.sql to an available warehouse that will serve as the Streamlit app warehouse. This script will create a Streamlit UI to manage table descriptions and initiate crawling.

Step 3 (using VSCode): Execute the scripts in sql/ in order of their leading filename numbers, e.g. 00__setup.sql before 01__catalog_table.sql.

Step 3 (using SnowCLI): Navigate to the project root in terminal. Execute the below commands in terminal. Note that you may need to pass your SnowCLI connection name with the --connection flag.

snow sql --connection="[connection-name]" -f sql/00__setup.sql
snow sql --connection="[connection-name]"  -f sql/01__catalog_table.sql 
snow sql --connection="[connection-name]" -f sql/02__catalog.sql
snow sql --connection="[connection-name]" -f sql/03__app.sql

Run Crawler

All necessary functions and Stored Procedures are now registered in theDATA_CATALOG.TABLE_CATALOG schema in Snowflake. Any desired database and/or schema available to the current user/role can be crawled.

Below is an example of calling the utility to crawl schema CATALOG in database JSUMMER. Results will be written to table DATA_CATALOG.TABLE_CATALOG.TABLE_CATALOG.

CALL DATA_CATALOG.TABLE_CATALOG.DATA_CATALOG(target_database => 'JSUMMER',
                                  catalog_database => 'DATA_CATALOG',
                                  catalog_schema => 'TABLE_CATALOG',
                                  catalog_table => 'TABLE_CATALOG',
                                  target_schema => 'CATALOG',
                                  sampling_mode => 'fast', 
                                  update_comment => FALSE
                                  );

The crawler calls an asynchronous Stored Procedure for each crawled table to allow parallelization. Increasing warehouse size and/or max concurrency will improve runtimes if crawling a large number of tables. Crawling the well-known spider dataset of >900 tables completed in 10 minutes on an extra small warehouse with max concurrency of 10.

Streamlit UI

Search, review, and revise any table descriptions

Specify a database and/or schema to crawl

The final script, sql/03__app.sql, creates a simple Streamlit user interface with 2 pages:

manage: Search, review, and revise any table descriptions.
run: Specify a database and/or schema to crawl.

The search feature on the manage page is a semantic search based on vector embeddings. Upon searching, tables descriptions will be listed according to their semantic similarity to the text searched. You may update descriptions on the manage page directly.

Table names not previously recorded in table DATA_CATALOG.TABLE_CATALOG.TABLE_CATALOG will be captured when crawling a database/schema on the run page.

Prompt and Context

Take a look at src/prompts.py (or below) to gain an understanding of what is passed to the LLM for table description generation. Variables wrapped in {} are replaced at runtime when tables details are being gathered.

You are a data analyst tasked with cataloging database tables.
Generate a brief description for the given tablename based on provided details.
For the given table, the generated description should characterize:
- data contained in the table
- column makeup
- pertinent details about related tables and referential keys in schema
For the given tablename, you will receive:
- column information
- user-entered comments, if available
- sample rows
- list of tables and their columns in the same schema, labeled schema_tables
Samples containing vector types have been truncated but do not comment on truncation.
The table name is prefixed by the parent database and schema name.
Follow the rules below.
<rules>
1. Do not comment on the vector truncation.
2. Generated descriptions should be concise and contain 50 words or less.
3. Do not use apostrophes or single quotes in your descriptions.
4. Do not make assumptions. If unsure, return Unable to generate table description with high degree of certainty.
</rules>
<tablename>
{tablename} 
</tablename>
<table_columns> 
{table_columns}
</table_columns>
<table_comment>
{table_comment} 
</table_comment>
<table_samples> 
{{table_samples}}
</table_samples>
<schema_tables>
{schema_tables}
</schema_tables>
Description:

If customizing the prompt, run the below command in VSCode or SnowCLI to use the latest prompt in your Data Crawler utility. Alternatively, you may upload src/prompts.py directly into stage @DATA_CATALOG.TABLE_CATALOG.SRC_FILES via Snowsight if desired.

PUT file://src/prompts.py @DATA_CATALOG.TABLE_CATALOG.SRC_FILES OVERWRITE = TRUE AUTO_COMPRESS = FALSE;

Final Remarks

I’m a firm believer that the war for the optimal enterprise GenAI offering will be won based on convenience (e.g. ease of access and integration) and security. Furthermore, I think the ideal adoption opportunity for enterprise GenAI is the automation of time-consuming, perhaps monotonous tasks. Here, we leverage the convenience of Snowflake Cortex LLM functions to securely crawl enterprise data and generate a searchable enterprise data catalog.