LLM hallucinations can actually be a feature: Using Snowflake Cortex and Streamlit to generate table and column descriptions

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min readJun 5, 2024

In my conversations at this year’s Snowflake Summit, many companies are concerned about utilizing LLM because they have a habit of “hallucinating”. They are so determined to answer your questions that they simply make things up. However, there are use cases where we actually want LLM to invent things based on very little data and where we will naturally have a human process to review the data.

One of my customers asked how Snowflake Cortex could help accelerate their data governance team come up with descriptions for columns. This is a fairly universal need and a great use of Cortex because it is far easier for business analysts to review descriptions versus writing them from scratch. Snowflake’s Copilot features also benefit from good column descriptions.

Streamlit Application

Below is the application I built in an afternoon to choose a table and then dynamically generate such prompts for Cortex. Later in this article I have the source code for the application.

Prompts for Cortex

Snowflake’s Cortex SQL Functions are very simple and only require two parameters, the name of a LLM model and a prompt. For example, this is one of the prompts that I came up with to generate both table DDL and dbt project YAML.

You are a tool with one goal: Generate descriptions and a primary key for a specific Snowflake table and its related columns. 
You will only respond with markdown text. 
Do not provide explanations.
Include a description of the table, outlining its overall purpose and contents.
Include a description for every column of what it represents, including how it relates to other columns.
Descriptions can be up to 30 words. 
In the markdown, include a dbt_constraints.primary_key test with the most likely primary key columns for the table.
The primary key may have only one column or may need multiple columns to be unique.
Only specify the dbt_constraints.primary_key test in a "tests:" block.
To help you, I will give you context about the table, the table ddl and a sample CSV extract of the data.
I need you to return markdown text exactly like : 
###
Alter Table DDL:
---------------
```sql

ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" SET COMMENT = '{ table description }';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ADD CONSTRAINT FLIGHTS_PK PRIMARY KEY ("{ PRIMARY_KEY_COLUMN_1 }", "{ PRIMARY_KEY_COLUMN_2 }");

ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "{ COLUMN_NAME_1 }" COMMENT '{ column description }';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "{ COLUMN_NAME_2 }" COMMENT '{ column description }';

```

DBT YAML:
---------
```yaml

version: 2
models:
  - name: "FLIGHTS"
    description: "{ table description }"
    columns:
      - name: "{ COLUMN_NAME_1 }"
        description: "{ column description }"
      - name: "{ COLUMN_NAME_2 }"
        description: "{ column description }"
    tests:
      - dbt_constraints.primary_key:
          column_names:
            - "{ PRIMARY_KEY_COLUMN_1 }"
            - "{ PRIMARY_KEY_COLUMN_2 }"

sources:
  - name: "DFLIPPO_PUBLIC"
    database: "DFLIPPO"
    schema: "PUBLIC"
    tables:
      - name: "FLIGHTS"
        description: "{ table description }"
        columns:
          - name: "{ COLUMN_NAME_1 }"
            description: "{ column description }"
          - name: "{ COLUMN_NAME_2 }"
            description: "{ column description }"
        tests:
          - dbt_constraints.primary_key:
              column_names:
                - "{ PRIMARY_KEY_COLUMN_1 }"
                - "{ PRIMARY_KEY_COLUMN_2 }"

```
###
context:
Not Applicable
###
ddl:
create or replace TABLE FLIGHTS (
 "year" NUMBER(38,0),
 "month" NUMBER(38,0),
 "day" NUMBER(38,0),
 "dep_time" NUMBER(38,0),
 "sched_dep_time" NUMBER(38,0),
 "dep_delay" FLOAT,
 "arr_time" NUMBER(38,0),
 "sched_arr_time" NUMBER(38,0),
 "arr_delay" FLOAT,
 "carrier" VARCHAR(255),
 "flight" NUMBER(38,0),
 "tailnum" VARCHAR(255),
 "origin" VARCHAR(255),
 "dest" VARCHAR(255),
 "air_time" FLOAT,
 "distance" FLOAT,
 "hour" FLOAT,
 "minute" FLOAT,
 "time_hour" VARCHAR(255)
);
###
data:
year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
2013,12,3,2153.0,2159,-6.0,2241.0,2301,-20.0,EV,4233,N16963,EWR,ALB,29.0,143.0,21.0,59.0,2013-12-03 21:00:00
2013,12,17,1845.0,1835,10.0,1940.0,1954,-14.0,US,2070,N752US,LGA,PHL,35.0,96.0,18.0,35.0,2013-12-17 18:00:00
2013,7,3,1458.0,1500,-2.0,1734.0,1755,-21.0,AA,1813,N5FEAA,JFK,MCO,124.0,944.0,15.0,0.0,2013-07-03 15:00:00
2013,11,22,1941.0,1905,36.0,2132.0,2106,26.0,DL,1473,N339NB,LGA,MEM,151.0,963.0,19.0,5.0,2013-11-22 19:00:00
2013,2,26,1549.0,1600,-11.0,1911.0,1925,-14.0,AA,1467,N3GTAA,LGA,MIA,171.0,1096.0,16.0,0.0,2013-02-26 16:00:00
2013,12,4,1348.0,1400,-12.0,1450.0,1507,-17.0,US,2150,N946UW,LGA,BOS,37.0,184.0,14.0,0.0,2013-12-04 14:00:00
2013,11,7,1359.0,1330,29.0,1715.0,1634,41.0,B6,431,N663JB,LGA,SRQ,162.0,1047.0,13.0,30.0,2013-11-07 13:00:00
2013,11,30,645.0,650,-5.0,852.0,918,-26.0,UA,245,N668UA,EWR,DEN,223.0,1605.0,6.0,50.0,2013-11-30 06:00:00
2013,3,7,2200.0,2105,55.0,2358.0,2259,59.0,EV,4700,N15572,EWR,CLT,83.0,529.0,21.0,5.0,2013-03-07 21:00:00
2013,8,21,2121.0,2030,51.0,2233.0,2203,30.0,FL,357,N958AT,LGA,CAK,57.0,397.0,20.0,30.0,2013-08-21 20:00:00
2013,7,30,1854.0,1855,-1.0,2155.0,2205,-10.0,VX,169,N523VA,EWR,LAX,331.0,2454.0,18.0,55.0,2013-07-30 18:00:00
2013,3,20,1521.0,1529,-8.0,1806.0,1815,-9.0,UA,987,N492UA,EWR,LAS,320.0,2227.0,15.0,29.0,2013-03-20 15:00:00
2013,1,7,1036.0,1020,16.0,1324.0,1330,-6.0,AA,731,N3GJAA,LGA,DFW,188.0,1389.0,10.0,20.0,2013-01-07 10:00:00
2013,7,16,858.0,900,-2.0,1120.0,1120,0.0,DL,485,N334NB,EWR,ATL,104.0,746.0,9.0,0.0,2013-07-16 09:00:00
2013,12,24,1513.0,1520,-7.0,1630.0,1705,-35.0,AA,341,N542AA,LGA,ORD,118.0,733.0,15.0,20.0,2013-12-24 15:00:00
2013,3,21,,1430,,,1554,,EV,5766,N828AS,LGA,IAD,,229.0,14.0,30.0,2013-03-21 14:00:00
2013,5,9,1610.0,1559,11.0,1804.0,1814,-10.0,MQ,3985,N640MQ,JFK,CVG,93.0,589.0,15.0,59.0,2013-05-09 15:00:00
2013,12,18,1015.0,946,29.0,1321.0,1313,8.0,UA,297,N820UA,EWR,LAX,333.0,2454.0,9.0,46.0,2013-12-18 09:00:00
2013,11,24,909.0,845,24.0,1202.0,1211,-9.0,UA,1162,N71411,EWR,SEA,330.0,2402.0,8.0,45.0,2013-11-24 08:00:00
2013,2,3,2105.0,1945,80.0,2323.0,2241,42.0,9E,3314,N921XJ,JFK,JAX,118.0,828.0,19.0,45.0,2013-02-03 19:00:00

And this is the resulting ALTER TABLE DDL:

ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" SET COMMENT = 'Table storing flight information, including flight status, times, and flight numbers.';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ADD CONSTRAINT FLIGHTS_PK PRIMARY KEY ("year", "month", "day", "flight");

ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "year" COMMENT 'The year of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "month" COMMENT 'The month of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "day" COMMENT 'The day of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "dep_time" COMMENT 'The actual departure time of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "sched_dep_time" COMMENT 'The scheduled departure time of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "dep_delay" COMMENT 'The difference between scheduled and actual departure time';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "arr_time" COMMENT 'The actual arrival time of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "sched_arr_time" COMMENT 'The scheduled arrival time of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "arr_delay" COMMENT 'The difference between scheduled and actual arrival time';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "carrier" COMMENT 'The airline carrier code';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "flight" COMMENT 'The flight number';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "tailnum" COMMENT 'The tail number of the aircraft';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "origin" COMMENT 'The airport code for the departure airport';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "dest" COMMENT 'The airport code for the arrival airport';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "air_time" COMMENT 'The actual duration of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "distance" COMMENT 'The distance between the departure and arrival airports';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "hour" COMMENT 'The hour of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "minute" COMMENT 'The minute of the flight';
ALTER TABLE "DFLIPPO"."PUBLIC"."FLIGHTS" ALTER COLUMN "time_hour" COMMENT 'The date and time of the flight in the format YYYY-MM-DD HH:MM:SS';

This the resulting YAML for dbt that could be dropped into a source or model file:

version: 2
models:
  - name: "FLIGHTS"
    description: 'Table storing flight information, including flight status, times, and flight numbers.'
    columns:
      - name: "year"
        description: 'The year of the flight'
      - name: "month"
        description: 'The month of the flight'
      - name: "day"
        description: 'The day of the flight'
      - name: "dep_time"
        description: 'The actual departure time of the flight'
      - name: "sched_dep_time"
        description: 'The scheduled departure time of the flight'
      - name: "dep_delay"
        description: 'The difference between scheduled and actual departure time'
      - name: "arr_time"
        description: 'The actual arrival time of the flight'
      - name: "sched_arr_time"
        description: 'The scheduled arrival time of the flight'
      - name: "arr_delay"
        description: 'The difference between scheduled and actual arrival time'
      - name: "carrier"
        description: 'The airline carrier code'
      - name: "flight"
        description: 'The flight number'
      - name: "tailnum"
        description: 'The tail number of the aircraft'
      - name: "origin"
        description: 'The airport code for the departure airport'
      - name: "dest"
        description: 'The airport code for the arrival airport'
      - name: "air_time"
        description: 'The actual duration of the flight'
      - name: "distance"
        description: 'The distance between the departure and arrival airports'
      - name: "hour"
        description: 'The hour of the flight'
      - name: "minute"
        description: 'The minute of the flight'
      - name: "time_hour"
        description: 'The date and time of the flight in the format YYYY-MM-DD HH:MM:SS'
    tests:
      - dbt_constraints.primary_key:
          column_names:
            - "year"
            - "month"
            - "day"
            - "flight"

sources:
  - name: "DFLIPPO_PUBLIC"
    database: "DFLIPPO"
    schema: "PUBLIC"
    tables:
      - name: "FLIGHTS"
        description: 'Table storing flight information, including flight status, times, and flight numbers.'
        columns:
          - name: "year"
            description: 'The year of the flight'
          - name: "month"
            description: 'The month of the flight'
          - name: "day"
            description: 'The day of the flight'
          - name: "dep_time"
            description: 'The actual departure time of the flight'
          - name: "sched_dep_time"
            description: 'The scheduled departure time of the flight'
          - name: "dep_delay"
            description: 'The difference between scheduled and actual departure time'
          - name: "arr_time"
            description: 'The actual arrival time of the flight'
          - name: "sched_arr_time"
            description: 'The scheduled arrival time of the flight'
          - name: "arr_delay"
            description: 'The difference between scheduled and actual arrival time'
          - name: "carrier"
            description: 'The airline carrier code'
          - name: "flight"
            description: 'The flight number'
          - name: "tailnum"
            description: 'The tail number of the aircraft'
          - name: "origin"
            description: 'The airport code for the departure airport'
          - name: "dest"
            description: 'The airport code for the arrival airport'
          - name: "air_time"
            description: 'The actual duration of the flight'
          - name: "distance"
            description: 'The distance between the departure and arrival airports'
          - name: "hour"
            description: 'The hour of the flight'
          - name: "minute"
            description: 'The minute of the flight'
          - name: "time_hour"
            description: 'The date and time of the flight in the format YYYY-MM-DD HH:MM:SS'
        tests:
          - dbt_constraints.primary_key:
              column_names:
                - "year"
                - "month"
                - "day"
                - "flight"

Final Complete Streamlit Application

Below is a fully functional Streamlit application that you are welcome to deploy in your own Snowflake environment.

Source Code:

# Import python packages
import streamlit as st
from snowflake.snowpark.context import get_active_session

st.set_page_config(layout="wide")
# Get the current credentials
session = get_active_session()

# Query the list of databases using INFORMATION_SCHEMA
@st.cache_data
def get_db() :
    return session.sql(f"""

  select database_name
  from INFORMATION_SCHEMA.DATABASES
  where TYPE = 'STANDARD'
  order by 1
  
""").to_pandas()

# Query all the tables in the selected database using INFORMATION_SCHEMA
@st.cache_data
def get_schema_tables(db_name) :
    # We want to exclude any tables without sample data to provide
    return session.sql(f"""

  select table_schema, table_name
  from "{db_name}".INFORMATION_SCHEMA.TABLES
  where table_type ='BASE TABLE'
  and row_count >= 10
  order by 1, 2
  
""").to_pandas()

# Query a Snowflake table's DDL
def get_table_ddl(fully_qualified_table_name) :
    df = session.sql("select get_ddl('table',?)", params=[fully_qualified_table_name]).to_pandas()
    return df.iloc[0][0]

# Execute SNOWFLAKE.CORTEX.COMPLETE and return an async job
# This allows multiple concurrent queries
def get_llm_response_async(llm_name, prompt_text) :
    return session.sql("""

        SELECT SNOWFLAKE.CORTEX.COMPLETE(?, ?) AS LLM_RESPONSE
        
        """, params=[llm_name, prompt_text]).first(block = False)

# Write directly to the app
st.title("Magic Data Description Generator ")
st.subheader("Choose a specific table to generate table and columns descriptions.")

a, b, c  = st.columns(3) 
with a : 
    db_option = st.selectbox('Choose a database', get_db() )

df = get_schema_tables(db_option)

# If the database has tables with data
if not df.empty:

    with b :
        list_of_schemas = df.sort_values('TABLE_SCHEMA')["TABLE_SCHEMA"].unique()
        schema_option = st.selectbox('Choose a schema', list_of_schemas)

    with c: 
        list_of_tables = df.sort_values('TABLE_NAME')[df["TABLE_SCHEMA"] == schema_option]["TABLE_NAME"].unique()
        table_option = st.selectbox('Choose a table', list_of_tables)

    fully_qualified_table_name = f"{db_option}.{schema_option}.{table_option}"
    table = session.table(fully_qualified_table_name)
    
    # Display the first 20 rows
    st.dataframe(table.limit(20), use_container_width= True )
    
    # Sample 20 rows
    extract = table.sample(n=20).to_pandas().to_csv(index=False)
    
    # Query the Snowflake table's DDL
    existing_ddl = get_table_ddl(fully_qualified_table_name)
    
    with st.expander("Existing DDL"):
        st.code(existing_ddl)

    human_information = st.text_area("Additional context about the table that Snowflake should consider:", "Not Applicable" )

    llm_name = st.selectbox("Large Language Model", ['mixtral-8x7b','snowflake-arctic','mistral-large','reka-flash','reka-core','llama2-70b-chat','llama3-8b','llama3-70b','mistral-7b','gemma-7b'])
    
    if st.button('Ask Cortex to generate descriptions'):

        first_ddl_prompt = f"""
You are a tool with one goal: Generate descriptions and a primary key for a specific Snowflake table and its related columns. 
You will only respond with markdown text. 
Do not provide explanations.
Include a description of the table, outlining its overall purpose and contents.
Include a description for every column of what it represents, including how it relates to other columns.
Descriptions can be up to 30 words. 
In the markdown, include a dbt_constraints.primary_key test with the most likely primary key columns for the table.
The primary key may have only one column or may need multiple columns to be unique.
Only specify the dbt_constraints.primary_key test in a "tests:" block.
To help you, I will give you context about the table, the table ddl and a sample CSV extract of the data.
I need you to return markdown text exactly like : 
###
Alter Table DDL:
---------------
```sql

ALTER TABLE "{db_option}"."{schema_option}"."{table_option}" SET COMMENT = '{{ table description }}';
ALTER TABLE "{db_option}"."{schema_option}"."{table_option}" ADD CONSTRAINT {table_option}_PK PRIMARY KEY ("{{ PRIMARY_KEY_COLUMN_1 }}", "{{ PRIMARY_KEY_COLUMN_2 }}");

ALTER TABLE "{db_option}"."{schema_option}"."{table_option}" ALTER COLUMN "{{ COLUMN_NAME_1 }}" COMMENT '{{ column description }}';
ALTER TABLE "{db_option}"."{schema_option}"."{table_option}" ALTER COLUMN "{{ COLUMN_NAME_2 }}" COMMENT '{{ column description }}';

```

DBT YAML:
---------
```yaml

version: 2
models:
  - name: "{table_option}"
    description: "{{ table description }}"
    columns:
      - name: "{{ COLUMN_NAME_1 }}"
        description: "{{ column description }}"
      - name: "{{ COLUMN_NAME_2 }}"
        description: "{{ column description }}"
    tests:
      - dbt_constraints.primary_key:
          column_names:
            - "{{ PRIMARY_KEY_COLUMN_1 }}"
            - "{{ PRIMARY_KEY_COLUMN_2 }}"

sources:
  - name: "{db_option}_{schema_option}"
    database: "{db_option}"
    schema: "{schema_option}"
    tables:
      - name: "{table_option}"
        description: "{{ table description }}"
        columns:
          - name: "{{ COLUMN_NAME_1 }}"
            description: "{{ column description }}"
          - name: "{{ COLUMN_NAME_2 }}"
            description: "{{ column description }}"
        tests:
          - dbt_constraints.primary_key:
              column_names:
                - "{{ PRIMARY_KEY_COLUMN_1 }}"
                - "{{ PRIMARY_KEY_COLUMN_2 }}"

```
###
context:
{human_information}
###
ddl:
{existing_ddl}
###
data:
{extract}
"""
 
        # Start the first query
        async_job1 = get_llm_response_async(llm_name, first_ddl_prompt)

        with st.status("Running Snowflake Cortex queries"):
            st.code(first_ddl_prompt)
            async_df1 = async_job1.result()

        # Some LLM have a bad habit of adding escape characters 
        # for underscores but we can remove them easily with replace
        ddl_response = async_df1[0][0].replace("\\_", "_")
        json_response = "" #async_df2[0][0].replace("\\_", "_")

        st.subheader("Response")
        st.markdown(ddl_response)

LLM hallucinations can actually be a feature: Using Snowflake Cortex and Streamlit to generate table and column descriptions

Streamlit Application

Prompts for Cortex

Final Complete Streamlit Application

Written by Dan Flippo