Unit testing approach for SQL based data platform development — Part 3

Murali Suraparaju
6 min readFeb 16, 2022

--

In the previous article, we examined the techniques to do unit testing on Databricks. In this article, we will see how to do the same in Snowflake. Just like the case with Databricks, dataset provided by Snowflake with the free trial account has been used for developing the unit tests. Code shown in this article can be found here.

Sample dataset
Figure 1: Sample dataset

Before we get into the detail, it is worth noting a few capabilities of Snowflake and differences with Databricks.

Capabilities

Initially, Javascript was the only language that Snowflake supported to write stored procedures. Snowflake Scripting, an extension to Snowflake SQL that adds support for procedural logic has been released as a preview feature (as of Feb 2022). This helps in creating stored procedures very similar to a traditional database. This approach is more familiar to SQL developers. Hence most of the techniques used earlier, by SQL developers can be used by Snowflake developers.

Additionally, Snowpark which has been released recently enables developers to use the Dataframe programming paradigm. This Dataframe paradigm is very popular in Spark and Pandas. As of this writing, Snowpark supports Scala. Support for Python has not yet been made generally available.

Differences with Databricks

  1. Unlike Databricks, there no support for sub folders. This means organizing code in worksheets is not a viable option. We can however create separate schemas and create stored procedures one for main driver and one for unit test. This might be an overkill. Instead we can stick to simple naming conventions for unit test stored procedures.
  2. We cannot invoke a worksheet from scheduler/API. We need to wrap the code in a stored procedure. Hence, unlike Databricks, we cannot (yet) create a Notebook and readily productionize it in Snowflake.
  3. Unlike Databricks, we cannot switch between use of 3rd gen language such as Python/Scala and SQL in a Worksheet or Stored procedure. Use of Snowflake Scripting, an extension to Snowflake SQL that adds support for procedural logic.

Hence the techniques used for Snowflake are different from Databricks. The following figure shows the approach that can used for Snowflake.

Snowflake program structure
Figure 2: Snowflake program structure

The newer Snowsight WebUI has been used for developing the stored procedures, but the older WebUI also can be used. The only thing to keep in mind is that the older WebUI needs additional delimiters “$$” in the stored procedure declarations when writing them using Snowflake scripting (SQL).

Functions to be tested

These are written as stored procedures, each of them containing an atomic business logic. They act on a temporary table (input) and produce a temporary table. The main driver sets up the input temporary table from source data and writes results to persistent tables.

Stored procedure to transform data
Figure 3: Stored procedure to transform data
Stored procedure to calculate order values by priority
Figure 4: Stored procedure to calculate order values by priority

Main data setup

This sets up the data by creating a temporary table on the existing table. This can be done for multiple tables. This step is essential because the callee function — is acting on the same table name for both main flow and unit test flow. Since this is a temporary table, the scope is within the session and there is no storage overhead of FAILSAFE. Additionally, the time travel period can be set to 0 in the table definition (data_retention_time_in_days = 0) to totally eliminate any storage overheads. In the code below, RAW_ORDERS is the temporary table — which the procedure TRANSFORM_ORDER_DATA() operates on.

Main data setup
Figure 5: Main data setup

Main driver

This invokes the stored procedures for setting up data, transforming data and computing order totals by priority. After invoking the procedures, results are stored in persistent tables. A simple CREATE OR REPLACE is shown here but it can be a MERGE INTO a target table based on a key.

The main driver
Figure 6: The main driver

Unit test driver

Now that we have covered the main flow, let us look at the test data flow. Just like the main flow, there is a Unit test driver. This unit test driver invokes the data setup and unit test stored procedures in sequence. This driver also creates a table to hold test results. This is either a permanent or transient (no FAILSAFE) table to keep track of unit test runs. In the below example, UNIT_TEST_RESULTS is the table that stores test results.

Unit test driver
Figure 7: Unit test driver

Unit test data setup

This creates the test data set that we want to use for testing. A temporary table that has the same name as the main flow is used. This table can be populated using INSERT INTO statically or by INSERT INTO <> SELECT (from another table). The second approach is shown here.

The unit test data setup
Figure 8: The unit test data setup

Unit tests

Now that the flow has been established, let us look at the unit tests. These are stored procedures and have the logic to invoke the procedure under test and the necessary assertion code. Just like in the Databricks example, the unit test results from each test are stored in the UNIT_TEST_RESULTS table to be processed later. That is, individual tests do not fail if an expected condition is not met. The test driver can look at the results and take necessary action. Just like in the Databricks example, a combination of CTE (WITH clause) and conditions (IFF, CASE .. WHEN .. THEN) techniques are used to assert on the data. This usually is the hardest part of SQL based unit testing for developers from Java/Scala background as getting this logic right needs some practice.

Test case for checking data transformation
Figure 9: Test case for checking data transformation
Test case for checking results of aggregation
Figure 10: Test case for checking results of aggregation

Tracking test results

When the test driver is invoked, test results are captured in UNIT_TEST_RESULTS table.

Test results
Figure 11: Test results

As seen above, unit testing in SQL based development without the use of any third party tools can be achieved, but is tedious. It would need good hands-on with SQL and the development cycles would be longer. This blog has not covered integration with orchestration tools such as Apache AirFlow/Azure Data Factory. It also has not covered the integration with CI tools such as Jenkins and Circle CI. However, given that both Databricks and Snowflake have REST interfaces for invoking Notebooks/stored procedures, this integration can be achieved.

--

--