How To Generate an Audit Table with Python and Matillion ETL for Snowflake

Techniques to improve the effectiveness of Python in your ETL/ELT workflows.

Stuart Werbner
Hashmap, an NTT DATA Company
9 min readFeb 17, 2022

--

What is Matillion ETL for Snowflake? Why should I use it?

Matillion ETL for Snowflake is a powerful, graphical tool that can be used to both load and transform data for storage in the Snowflake Data Cloud. Matillion ETL provides a Python script orchestration component to aid in processing and storing data from external data sources into Snowflake. In this article, we will look at some techniques that can help you more effectively use Python within your ETL/ELT workflows.

The workflow we will study generates an Audit Table in Snowflake for a set of 200 tables that have been ingested into Snowflake from a Microsoft SQL Server Database — this was done in a separate workflow that falls outside of the scope of this article. The audit table records the row counts of each table in the external Microsoft SQL database and its corresponding ingested table in Snowflake.

For the purposes of this example, the two sets of row counts are expected to be identical at the time of the audit. The audit table will also capture some other information about each ingested table such as the table name, schema name, and the date-time stamp recorded for the completion of the table ingestion.

So where do we start?

The first step in the audit workflow is to access a previously created Control Table in Snowflake that contains basic information on each table ingested from the external Microsoft SQL Server database including the table name, the name of its corresponding schema, and the date-time stamp recorded at the completion of the table ingestion.

For both convenience and speed in Matillion, these fields are loaded from the Control Table and stored in a Grid Variable (GV) that resembles a database table but resides in memory. Since it is known from the outset that there are only 200 ingested tables that correspond to the number of rows in the Control Table, it is well within the capability of Matillion to store the relevant fields into a GV.

Additionally, the Audit ID, a value labeled as EV_AUDIT_ID, is shared by all the tables for a particular ingestion job, and it is provided by an Environment Variable (EV) that’s set by an earlier ingestion workflow. As an EV, the Audit ID is available to all jobs within the illustrated workflow.

The remaining steps to generate the Audit Table entries are performed by iterating on each row of the GV initialized in the previous step. They are repeated for each table that was ingested from the SQL Server database into Snowflake.

Initial Iteration Steps: Perform Row Counts on External and Ingested Tables

  1. Perform a row count on an external table (residing in Microsoft SQL Server) that was ingested into Snowflake, and store the value in a Snowflake Temporary Table that is cleared for each iteration. In order to get an accurate row count audit, we require that no modifications be made to the external table once it’s been ingested into Snowflake. Once this audit step is completed, the external table may be safely modified.
  2. The external table row count is then moved from the Snowflake Temporary Table into a Job Variable (JV) JV_SRC_ROW_CNT for easy access by subsequent steps.
  3. Perform a row count on the corresponding ingested tables in Snowflake, and store this value directly in another JV JV_TGT_ROW_CNT. The “no modifications” rule mentioned above still applies here, so make sure to keep that in mind in order to get an accurate row count audit.

Matillion Python Script Component 1: Compare the Row Counts

Now that we have both the row count for the external table and its ingested copy, we can use Python to set another JV, JV_IS_MATCH to the Boolean comparison result between the two row counts JV_SRC_ROW_CNT and JV_TGT_ROW_CNT. This value is eventually written to the Audit Table, as part of the row insert for this iteration.

Notice that the Python code for this module contains a print statement to provide aid in debugging. Beyond just adding the print statement, we guard it with the test of a pre-defined JV set within the Matillion environment: JV_PY_DEBUG (set to 1 for True and 0 for False). In addition, we do the same thing in the subsequent Python script modules. That way, via Matillion and independent of any Python code, we can control all of the Python debug print statements throughout all of the Python script modules with a single JV that’s local to this orchestration workflow.

Matillion Python Script Component 2: Organize the Audit Table Row Attributes

In the interest of making the Matillion Python code more readable and maintainable, we organize the Audit Table column names with their corresponding values for this particular iteration into a Python dictionary (dict), leveraging the power of this built-in Python data structure.

In the dict, each entry consists of key-value pairs. Each key is a string containing the name of the column, and the corresponding value is the attribute value. We then load the lists of the column names and their corresponding values from the Python dict into separate Matillion Grid Variables (GVs) labeled GV_COLUMN_NAMES and GV_TABLE_ATTRIBUTES while preserving the correspondence between them.

Note: This Python script component also captures the Audit ID from the EV_AUDIT_ID environment variable.

Matillion Python Script Component 3: Insert Audit Table Row Attributes

In the final Python script module within the iteration, we perform a SQL Insert operation to add a new row to the table containing the audit information gathered from the previous steps.

Since the actual Insert operation could fail for various reasons, we embed it in a try-except block to better document this possibility, and also to produce improved error diagnostics messages. In case of a failure with the Insert operation, we catch the exception thrown by the cursor, print a custom descriptive diagnostics message, and then re-raise the original exception.

Doing this also ensures the Audit Table Orchestration Job terminates along the so-called error path, clearly indicating that an insertion error occurred while attempting to modify the Audit Table in Snowflake. The so-called error path is FAIL_Insert_Audit_Table_Entry within the iteration and FAIL_Modify_Ingestion_Audit_Table at the top level. We have designed the Python code to continue the Audit Table row insertions unless such a failure occurs.

Take note of the fact that we do not raise an error in the event that there is a mismatch between the external and ingested table row counts for each table in the external database. This provides the opportunity for subsequent workflows to process the row count mismatch condition independently.

Audit Table processing along the so-called success path continues provided that all processed rows are successfully inserted into the Audit Table. The so-called success path is PASS_Modify_Ingestion_Audit_Table, within the iteration, and PASS_Insert_Audit_Table_Entry at the top level.

Because the operation of inserting a row in a Snowflake table is potentially a common one, we’ve designed this entire Python script module to be reusable, provided that input parameters to the module are properly configured:

  • JV_PY_DEBUG — Previously mentioned JV that controls the printing of debug messages
  • JV_INSERT_TABLE — The fully qualified name of the Snowflake table into which a row is inserted.
  • GV_COLUMN_NAMES — List of names of columns whose corresponding values (see GV_TABLE_ATTRIBUTES) are to be inserted.
  • GV_TABLE_ATTRIBUTES — List of attribute values (see GV_COLUMN_NAMES) to be inserted.

Note: This parameterized approach is currently the only way to reuse a particular pre-defined Matillion Orchestration component.

Additional Python Tips

Did you notice that we broke up the Python code for the iteration workflow into three separate Python script components? The purpose of this is to improve the readability and maintainability of the Matillion Orchestration job by organizing the Python code into separate logical components that perform their own limited and distinct function.

In the last Python script module, we added a method to force all character data inserted via the Matillion Jython code to be UTF-8 encoded — otherwise, an error occurs precluding the insertion of the row data into the Audit Table. This condition appears to be some kind of limitation with Matillion Jython, which is needed to perform the SQL Insert operations. The code for this UTF-8 is written in a separate method, which allows it to more easily be reused both inside and outside of this Orchestration Job, and also improves readability. We also put the code that both generates the SQL Insert query and issues the Insert query in separate methods for the same reasons.

Wrap-Up

Python can be an effective tool for your ETL/ELT workflows, and I hope this walkthrough helped you in understanding its capabilities, especially in the realm of using it with Matillion ETL for Snowflake. Are there any aspects of Python (or Matillion ETL for Snowflake) that you want to understand better? Leave a comment to let me know!

Additional Resources

Ready to Accelerate Your Digital Transformation?

At Hashmap, an NTT DATA Company, we work with our clients to build better, together. We are partnering with companies across a diverse range of industries to solve the toughest data challenges and design and build data products — we can help you shorten time to value!

We offer a range of enablement workshops and assessment services, data modernization and migration services, and consulting service packages for building new data products as part of our service offerings. We would be glad to work through your specific requirements. Connect with us here.

Stuart Werbner is a Python developer and Senior Software Engineer with Hashmap, an NTT DATA Company, providing Data, Cloud, IoT, and AI/ML solutions and consulting expertise across industries with a group of innovative technologists and domain experts accelerating high-value business outcomes for our customers. Have a question? Ask it in the comments or reach out to the Hashmap team for more info.

--

--