Minimalist Snowflake Table Compare (using data-diff)

The most exciting part of being a Snowflake Data Superhero!

Here is a fun little project for anyone wanting to dive deep into Streamlit and data-diff. The objective is to showcase how seamless it is to integrate any python library with Snowflake using Streamlit. Once Streamit is available natively in Snowflake, the entire ‘Data App’ will be inside Snowflake!

We are only scratching the surface of what we can accomplish with Streamlit, Snowpark, and Snowflake together.

Objective

We aim to create a bare, skin and bone, Snowflake Data Integrity checker in Streamlit. We will use a fantastic open-source tool by the folks at Datafold called data-diff.

The purpose is to check if the data in the target table is in sync with the source table.

Use-Cases

  1. Spot check for any table across different Snowflake environments (same account) or across two different Snowflake accounts.
  2. Quickly get the delta between two tables for further analysis and reconciliation.
  3. Keep constant checks on important tables and have a single view to check if the sync is working as expected.

MVP Limitations

The project is meant as a quickstart to help developers. Consequently, this is a minimal product with limitations.

  1. This only supports comparing one table at a time (more on this later).
  2. The current version does not have an automatic reconciliation mechanism in place.

Assumptions

For the purpose of this demo, we are going to make some assumptions:

  1. Our source and our target is Snowflake tables (although you can have different databases just by minor tweaking)
  2. We are assuming the tables have synced already.
  3. We only want to spot-check two Snowflake tables for variance.
  4. These tables consist of primary keys (again, not necessary if you use data-diff features)

In Scope

  1. Streamlit and data-diff application which shows the difference between the synced data.
  2. Ability to call this application and check as-of-now status.
  3. Ability to materialize the delta to a Snowflake table for visibility and reconciliation.
  4. Check missing rows, additional rows, and delta rows.

Out of Scope

For this utility, the following steps are out of scope:

  1. Setting up Snowflake accounts. You can check here.
  2. Comparing two different Snowflake accounts (a matter of changing the connection parameters).

About Data-Diff by Data Fold

As evident, this application won’t be possible without the data-diff tool.

As per the doc:

data-diff enables data professionals to detect differences in values between any two tables. It’s fast, easy to use, and reliable. Even at massive scale.

With our app, we are just over the surface of data-diff. This tool has massive capabilities and it’s worth exploring. Thanks to the team at Datafold for open-sourcing it.

Below are some use cases for data-diff. More details here.

  1. Verify data migrations.
  2. Verify data pipelines.
  3. Maintain data integrity SLOs.
  4. Debug complex data pipelines.
  5. Detect hard deletes for an updated_at-based pipeline.
  6. Make your replication self-healing.

High-Level Code

Everything below is explaining the code. To get this up and running in your environment, all you have to do is change the connection setting and you are good. Here is the repo with the code:

Note: I am not an expert Python developer and hence the code is not optimal (using functions, mains, error handling, etc.)

  1. Create a Secrets file. Note to be careful with these files and exclude them while checking the code in a public repository.
[snowflake]
user = "yourusername"
password = "secret"
account = "abc12355"
warehouse = "DIFF_WH"
database = "CUSTOMER_DB"
schema = "CUST"
role = 'DIFF_RW'

2. Imports: Next step is to import the modules

import logging
import tzlocal
import pandas as pd
import snowflake.connector
import streamlit as st
from data_diff import connect_to_table, diff_tables
import logging
from plotly.subplots import make_subplots
import plotly.graph_objects as go

3. Configurations: We will then want to set certain configurations and add some markdown for our Streamlit app to look more appealing.

Please veiw the code in git for details.

4. Connection Establish: Next we want to establish a connection to Snowflake. Pretty well documented in Streamlit doc, but here is a snippet of the code.

@st.cache_resource
def init_connection():
return snowflake.connector.connect(
**st.secrets["snowflake"], client_session_keep_alive=True
)
conn = init_connection()

5. The Streamlit app

Here is how our Streamlit app should look. The idea is to give user the option to select the source and target details. It also provides a summary view of the counts of mismatch and then the option to download the file

The Final Application

Wait for it!

Next Steps:

I have few ideas on how to improve the App further:

  1. Scheduled data diff to run at certain interval.
  2. Provide capability to find data diff across multiple tables and schemas.
  3. Add other databases as source to do cross database diff.

Code Git Repo: https://github.com/SudhenduP/snowflake-data-diff

Datafold data-diff: https://github.com/datafold/data-diff

Streamlit: https://streamlit.io/

Snowflake Quickstart: https://quickstarts.snowflake.com/

--

--

Sudhendu
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Mostly running, hiking. Snowflake Data Superhero. Quora Top Writer २०१४. Work @ kipi.bi. Views are my own।