Regression testing in Snowflake using data-diff

Published in

Inside Business Insider

4 min readJul 25, 2023

Once upon a time there was a data engineer named Naghul who worked for a company that used Snowflake as its main data warehousing solution. Naghul’s job was to design, build, and maintain the data pipelines that kept the company’s data flowing smoothly and accurately.

One day, Naghul received a report from one of his colleagues that revealed discrepancies in some of the data. Some of the numbers did not add up, which created confusion among the teams who used the data.

Naghul did some digging and found there were recent changes made to the data pipelines that potentially caused the issues.

Fortunately, he remembered learning about regression testing, a method of testing involving re-running a set of tests to make sure that changes to the code do not cause any unintended consequences. Better yet, he found a helpful Python package called data-diff to automate the regression testing process for Snowflake.

Naghul installed data-diff through the Python package manager to compare two sets of data.

pip3 install data-diff

In addition, he used a TOML configuration file called config.toml that included the necessary database connection information and run parameters.

# config.toml
[database.REPORT_DATASTORE_DB]
driver = "snowflake"
database = "REPORT_DATASTORE_DB"
user = "snowflakeuser"
password = "snowflakepassword"
# You can either store the password as a raw text string or an environmental variable.
account = "np12345.us-central1.gcp"
schema = "PUBLIC"
warehouse = "LOAD"
role = "ACCOUNTADMIN"
### RUN PARAMS
# Specify the run parameters for a run called xdb_validation
[run.regression_testing]
# Source 1 ("left")
1.database = "REPORT_DATASTORE_DB"
1.table = "TEST_TABLE"
# Source 2 ("right")
2.database = "REPORT_DATASTORE_DB"
2.table = "PROD_TABLE"

Once Naghul set up his configuration file, he used the data-diff CLI to validate the data.

data-diff \
 - conf ~/config_files/datadiff.toml \
 - run regression_testing \
-k pk_id

This command compared the data in the TEST_TABLE and PROD_TABLE tables and used the primary key id, pk_id, to match rows between the two tables.

The output showed the pk_id value 1c44f14f-4164–4e6f-b0cf-cef0193f0b67 was not present in the TEST_TABLE, but was found in the PROD_TABLE. Similarly, the pk_id value 44393a6c-fb47–4310–99ae-343865dca1a0 was present in the TEST_TABLE, but not in PROD_TABLE.

While this provides a visual validation, Naghul wanted to materialize the results. So he appended the command with -m REPORT_DATASTORE_TEST_DB.PUBLIC.TESTRESULTS. He noted results can be materialized in any database and schema, not necessarily in the same database.

data-diff \
 - conf ~/config_files/datadiff.toml \
 - run regression_testing \
-k pk_id
-m REPORT_DATASTORE_TEST_DB.PUBLIC.TESTRESULTS

Naghul also added a couple of other columns, like subscription_id and product_id, to the validation.

data-diff \
 - conf ~/config_files/datadiff.toml \
 - run regression_testing \
-c subscription_id \
-c product_id \
-k pk_id
-m REPORT_DATASTORE_TEST_DB.PUBLIC.TESTRESULTS

The data looks good. He wonders whether he can get a data difference score and other statistics, so he appended -s to the CLI.

data-diff \
 - conf ~/config_files/datadiff.toml \
 - run regression_testing \
-c subscription_id \
-c product_id \
-k pk_id
-m REPORT_DATASTORE_TEST_DB.PUBLIC.TESTRESULTS
-s

The results of above:

1008 rows in table A
1008 rows in table B
1 rows exclusive to table A (not present in B)
1 rows exclusive to table B (not present in A)
0 rows updated
1007 rows unchanged
0.10% difference score
Extra-Info:
rows_downloaded = 53

After reading the results, he is now confident that only the intended data has been modified and there are no regressions in the dataset.

Thanks to his use of regression testing, data-diff, and the provided TOML configuration file, Naghul quickly identified and resolved the data issues, and he can easily re-run the same comparison in the future to make sure the data remains accurate and up-to-date.

Additional use cases for data-diff

Validate the success of data migrations, replication, and pipelines.
Verify data pipelines that move data from a relational database to a warehouse or data lake.
Alert your team if data is missing. Debug complex data pipelines and locate lost rows efficiently using data-diff.
Compare tables within the same database to confirm successful transformations.
Validate the stability of downstream tables that depend on upstream models you have refactored.
Conduct better code reviews by running a diff to catch errors that may have been missed.

References

Regression testing in Snowflake using data-diff

Additional use cases for data-diff

Published in Inside Business Insider

Written by Vinoth Nageshwaran