How to use YData Profiling in Snowflake

Profile your Snowflake data with ease

Daria Rostovtseva

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readFeb 20, 2024

Image of a polar bear profiling data, courtesy of DALL-E 2024

Introduction

YData Profiling is a powerful python library for creating comprehensive profiling reports for pandas dataframes. It is great for exploratory data analysis because it allows you to create visually appealing reports in HTML format showing variable distributions, correlations, missing data patterns and samples of actual data, all with just a couple of simple commands.

If your data is stored in Snowflake, Snowpark for python enables you to use ydata-profiling natively, without moving the data out of Snowflake. The package is part of the Anaconda Snowflake channel, meaning that it is readily available for use with a Snowflake virtual warehouse.

In this article I walk you through a basic example of using this package to create a profile report on one Snowflake table and save it to a Snowflake stage.

Step 1.

Let’s create some sample data. We’ll use the python faker package to create some synthetic PII data.

-- Create database and schema
CREATE OR REPLACE DATABASE test_db;
USE SCHEMA test_db.public;
USE WAREHOUSE analytic_wh;

-- Create a function to make some sample data
CREATE OR REPLACE FUNCTION test_db.public.faker(a STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.10
PACKAGES = ('faker')
HANDLER = 'udf'
AS $$
from faker import Faker
def udf(a):
    fake = Faker()
    if a == 'first_name':
        return fake.first_name()
    elif a == 'last_name':
        return fake.last_name()
    elif a == 'ssn':
        return fake.ssn()
    else:
        return 'other'
$$;

-- Create a table for sample PII data
CREATE OR REPLACE TABLE test_db.public.sample_pii AS
SELECT
    faker('first_name') AS first_name,
    faker('last_name') AS last_name,
    faker('ssn') AS ssn,
    UNIFORM(0, 100, RANDOM()) AS age
FROM TABLE(GENERATOR(ROWCOUNT => 1000)) v;

-- View sample data
SELECT * FROM test_db.public.sample_pii LIMIT 100;

Step 2.

Create a stage for outputting profile reports. This can be an internal Snowflake stage, as in my example, or an external stage pointing to cloud storage like AWS S3, Azure blob or Google cloud storage.

-- Create stage - This is where the HTML report will be saved
CREATE STAGE test_db.public.profile_reports_stg 
    DIRECTORY = (ENABLE = TRUE) 
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE');

Step 3.

Create a stored procedure to run the profile with ydata-profiling. The procedure expects database, schema and table names, as well as the name of the output file and the target stage where the profile report will be saved. For ease of management, the provided output file name will be concatenated with the table name and the date of execution.

-- Create a procedure for ydata-profiling
CREATE OR REPLACE PROCEDURE test_db.public.ydataprofiling(
    database_name STRING, 
    schema_name STRING, 
    table_name STRING, 
    output_file_name STRING, 
    target_stage STRING)
RETURNS STRING
LANGUAGE PYTHON
RUNTIME_VERSION = 3.10
PACKAGES = ('snowflake-snowpark-python', 'pandas', 'ydata-profiling')
HANDLER = 'profile_table'
AS
$$
import _snowflake
import pandas as pd
from ydata_profiling import ProfileReport
import os
from datetime import datetime

def profile_table(session, database_name, schema_name, table_name, output_file_name, target_stage):
    # Format the current date
    current_date = datetime.now().strftime("%Y%m%d")

    # Update the report file name to include the table name and the date of execution
    report_file_name = f"{table_name}_{output_file_name}_{current_date}.html"
    report_file_path = f"/tmp/{report_file_name}"

    # Convert Snowflake table to a pandas dataframe
    df = session.table(f"{database_name}.{schema_name}.{table_name}").to_pandas()

    # Generate the profile report
    profile = ProfileReport(df, title=f"Table Profile for {table_name}")
    profile.to_file(report_file_path)

    # Upload the profile report to the target stage
    session.file.put(report_file_path, target_stage, auto_compress=True, overwrite=True)

    # Clean up the local report file
    os.remove(report_file_path)

    return f"Profile report {report_file_name} generated and uploaded successfully."
$$;

Step 4.

Call the stored procedure.

-- Call the procedure
CALL test_db.public.ydataprofiling(
    'test_db', 
    'public', 
    'sample_pii', 
    'rpt', 
    'profile_reports_stg');

-- List files in the stage
LIST @test_db.public.profile_reports_stg;

Profile report is saved to a Snowflake stage

To download the report from the stage, navigate to the stage in the Snowflake UI. Alternatively, you can use the GET command in SnowSQL or your local Python to move files from a Snowflake stage to your local environment.

Download profile report from Snowflake UI

That’s it! Your report will look something like this:

Conclusion

You can easily profile your Snowflake data with the python ydata-profiling package without moving the data out of Snowflake. If you need to run the profile over many tables at once, this procedure can be modified to loop through a list of tables (ask me for an example!). You can also modify this procedure to look at a sample of data, and to schedule it to run regularly using Snowflake tasks.