Bringing Gen AI to ETL: Redefining Data Engineering through LLMs

The Future of ETL is Here (through Snowflake)

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

7 min readOct 22, 2024

The modern Data Engineering space has been marked by incredibly rapid product innovations that test and expand the diagram of the “Modern Data Stack”. The newest tidal wave to wash the data shores seems to be what we can call ETL-LLM.

ETL-LLM is exactly what it sounds like. LLMs, specifically Generative AI, applied to ETL/ELT pipelines. The benefits here range from accelerating/automatic manual workflows to performance boosts for key KPIs.

In fact, some can say this revolution is already here. Data Observability tools like Fiddler AI is now brandishing their AI observability functionalities. Data Lineage companies have quickly pivoted to AI to stay ahead of the product and marketing curve.

Snowflake’s entrance to LLMs in ETL workflow is far more expansive. It is no secret that Snowflake has been the top of class Data Warehousing solution in the market. Always has been. But now, Snowflake is taking one giant step further by enabling customers to bake in AI calls into its core data products. The innovation here is to seamless blend Data Science into Data Engineering workloads in order to super charge the latter. ETL-LLM.

Data Quality + LLM: DQ for Unstructured Data

One simple but powerful illustration of this blend is the ability to call LLM within Snowflake’s Data Quality monitoring functions.

Quick backdrop: Snowflake has recently released Data Quality and Data Metrics functions as part of the Enterprise Edition. With this feature, customers can easily call or create custom data metrics functions to measure data freshness, accuracy, completeness, and more.

CREATE DATA METRIC FUNCTION IF NOT EXISTS
  invalid_email_count (ARG_T table(ARG_C1 STRING))
  RETURNS NUMBER AS
  'SELECT COUNT_IF(FALSE = (
    ARG_C1 REGEXP ''^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$''))
    FROM ARG_T';

Here’s an example of a custom data metrics function create through one line SQL call. The regex searches for email address stored in a specific format. The query counts how many email addresses do not match this expression in a numeric format.

ALTER TABLE customers ADD DATA METRIC FUNCTION
  invalid_email_count ON (email);

Next, you apply the data metric function on the email column from the customers table. Pretty standard stuff. [Full tutorial here]

Where this gets exciting is the ability to call Snowflake’s LLM functions natively within the data metrics function (DMF).

CREATE or replace DATA METRIC FUNCTION 
  llm_email_check (ARG_T table(ARG_C1 STRING))
  RETURNS NUMBER AS
  $$SELECT sum(cast(snowflake.cortex.complete('llama3.1-70b', 
    concat(ARG_C1,' is this email in this text written correctly? Respond only as 0 if yes and 1 if no')) as number(5,2))) 
    from ARG_T $$;

This call nests Snowflake’s Cortex Complete function within the DMF. The call executes the exact same task as the regex above, but through a free-form text.

This is just the beginning. Being a natural language text, you can virtually put any kind of creative commands on the unstructured textual data for data quality checking. For instance, let’s say you have data in this form:

You can ask all kinds of questions that transcend regex and traditional NLP capabilities for data quality checking. Think profanity checking. Or PII information. Or sentiment analysis measurements. All kinds of data quality and monitoring evaluations can be executed with slight changes to the prompt command:

CREATE or replace DATA METRIC FUNCTION 
  profanity_checker(ARG_T table(ARG_C1 STRING))
  RETURNS NUMBER AS
  $$SELECT sum(cast(snowflake.cortex.complete('llama3.1-70b', 
    concat(ARG_C1,' is there profanity in this text? Respond only as 1 if yes and 0if no')) as number(5,2))) 
    from ARG_T $$;

CREATE or replace DATA METRIC FUNCTION 
  llm_grammar_check(ARG_T table(ARG_C1 STRING))
  RETURNS NUMBER AS
  $$SELECT sum(cast(snowflake.cortex.complete('llama3.1-70b', 
    concat(ARG_C1,' is there grammatical mistake in this text? Respond only as 1 if yes and 0if no')) as number(5,2))) 
    from ARG_T $$;

You can see that with very slight variations in the prompt, you can execute very different kinds of data quality and data governance checks.

Let’s see what the results look like with the profanity checker DQ policy. To review the results, run the query that calls Snowflake’s metadata layer:


SELECT scheduled_time, measurement_time, metric_name, value
FROM SNOWFLAKE.LOCAL.DATA_QUALITY_MONITORING_RESULTS
WHERE TRUE
AND METRIC_DATABASE = 'YOUR_DATABASE'
AND TABLE_NAME = 'YOUR_TABLE';

Set to run every 5 minutes, you see that my profanity checker has detected 3 rows where profanity exists in the sample table.

Going Further: Dynamic Real-time Data Quality Checks through String Substitution

Snowflake’s Cortex LLM capabilities unlock a new dimension in data quality validation, allowing for flexible, context-aware checks that adapt to dynamically changing parameters rather than relying on static rule-based metrics. This feature enables DMFs to leverage real-time contextual information, such as time ranges or threshold values, directly within LLM-powered queries via string substitution.

Let’s say we have a table like this. Consider a scenario where we need to validate the VIDEOS_UPLOADED column of a table, checking if any values fall outside the minimum and maximum range for the previous week. Each week, the validation range updates dynamically, allowing us to assess new data against the latest weekly bounds.

Here’s how you could implement this dynamic, LLM-powered data quality check in Snowflake:

import pandas as pd
from datetime import datetime, timedelta
from snowflake.snowpark.context import get_active_session
session = get_active_session()

# Calculate last week's date range
today = datetime.now()
last_week_start = today - timedelta(days=7)

# Filter for the last week
filtered_df = df[(df['date'] >= last_week_start) & (df['date'] <= today)]

# Get min and max values for last week
min_filtered_date_value = filtered_df['VIDEOS_UPLOADED'].min()
max_filtered_date_value = filtered_df['VIDEOS_UPLOADED'].max()

# Inject min and max values for last week (dynamically changing) to sql query via string substitution
sql_query = 
f"""CREATE or replace DATA METRIC FUNCTION 
  profanity_checker (ARG_T table(ARG_C1 STRING))
  RETURNS NUMBER AS
  $$SELECT sum(cast(snowflake.cortex.complete(
      'llama3.1-70b', 
          concat(ARG_C1,
              ' Does this data fall within range of 
                    {min_filtered_date_value} and {max_filtered_date_value}? 
                  Respond only as 1 if yes and 0 if no')) as number(5,2))) 
     from ARG_T  
  $$;"""

session.sql(sql_query).collect()

In this code, we dynamically embed last week’s min and max values into the LLM-based validation query. By incorporating live data into the prompt, this approach allows for real-time, context-sensitive data quality assessments that evolve based on your data’s requirements.

This flexible and adaptive setup expands data quality monitoring capabilities by moving beyond traditional static checks to incorporate dynamic, real-world context. Whether you’re managing time-sensitive metrics or responding to rapidly changing data patterns, LLM-powered DMFs in Snowflake bring new possibilities for agile, intelligent data quality management.

Dynamic Data Masking + LLM: Toward Ultra-Granularity through AI

Another exciting application of LLMs is in Snowflake’s Dynamic Data Masking Policy (DDM). This feature masks specified column at query runtime without altering the underlying data itself. This runtime behavior of dynamic evaluation is based on execution context which can really be anything, such as the following:

User Roles: Policy checks the role of user executing the query
Query conditions: Depending on the conditions specified in the query.

Now the User Roles condition is easy to understand. This demo showcases the simple call through the query below:

CREATE OR REPLACE MASKING POLICY CUSTOMER_INFO_MASK AS (val string) RETURNS string ->
  CASE
    WHEN CURRENT_ROLE() IN ('AUDIT_USER') THEN val
    ELSE '*********'
  END;

In this example, the simple logic condition is set so that if the role is not AUDIT_USER, then Snowflake would mask a column in a specific table defined by this query:

ALTER TABLE IF EXISTS SAMPLE_TABLE MODIFY COLUMN CUSTOMER_DETAILS SET MASKING POLICY CUSTOMER_INFO_MASK;

This would then turn the original table that looks like this:

Into This:

Now, where this gets interesting is in the second proposition of “Query conditions”, meaning you can apply any kind of sql logic to the data masking policy — including, you guessed it, LLMs.


CREATE OR REPLACE MASKING POLICY llm_dynamic_pii_masking_policy AS (val STRING) RETURNS STRING ->
  CASE
    WHEN CURRENT_ROLE() != 'AUDIT_USER' THEN
      snowflake.cortex.complete(
         'llama3.1-70b',
         concat(val, ' This is just dummy data. Detect PII from this data and mask it as "***" when you see it. Return only the newly-masked data as the result. Do not return any other text.')
       )
    ELSE val
  END;

In this sql query call, just like Data Quality Functions, you can nest Cortex LLMs within Snowflake’s masking policy object.

The benefits of this approach speaks to itself. The image above shows the results of why using LLMs for data masking is so powerful compared to user-based masking or other kinds of conditional logic. Here, Llama’s Generative AI model has used its reasoning and logic (not regex) to look into the text entirely and mask data where it saw as sensitive.

This level of granualrity is what LLMs offer once applied to data masking functionality, which can be applied at scale. The application is again executed in runtime, meaning the underlying data has not changed.

Conclusion: The Future of ETL is Here

The integration of LLMs into ETL pipelines marks a paradigm shift in how we approach data quality, observability, and automation. By baking AI capabilities directly into its platform, Snowflake is expanding the Modern Data Stack into new frontiers.

With the ability to seamlessly execute sophisticated checks like PII detection, profanity screening, and grammar validation through natural language prompts, Snowflake is setting a new standard. It’s not just about moving and transforming data anymore — it’s about making it intelligent and adaptable. As Snowflake pushes this evolution forward, the possibilities are limitless.

Bringing Gen AI to ETL: Redefining Data Engineering through LLMs

The Future of ETL is Here (through Snowflake)

Data Quality + LLM: DQ for Unstructured Data

Going Further: Dynamic Real-time Data Quality Checks through String Substitution

Dynamic Data Masking + LLM: Toward Ultra-Granularity through AI

Conclusion: The Future of ETL is Here

Published in Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Written by Haebichan Jung

Responses (3)