Unit testing approach for SQL based data platform development — Part 2

Murali Suraparaju
5 min readFeb 16, 2022

--

In the previous article, we examined the unit test approach — conceptually. In this part, we will build it out on Databricks. As mentioned in the previous part, this is just an approach and does not involve use of any other frameworks/tools. Hence any developer can use it even in a restricted environment. Code shown in this article can be found here.

Dataset and use cases

The sample weather data set available with community edition of Databricks has been taken for implementing this approach. The data set has one parquet having this structure -

Sample dataset
Figure 1: Sample dataset

To showcase the unit testing approach, the following questions are asked of the dataset (use cases):

  1. Calculate monthly average temperatures

2. Calculate monthly average temperatures by elevation zone

3. Calculate monthly average temperatures by latitude

4. Calculate the daily temperature variance for each station

Logical program structure

The following figure shows the various components involved in implementing this technique on Databricks:

Databricks program structure
Figure 2: Databricks program structure

Notebook folder structure

To easily relate to the directory structure found in Java/Scala, developers can organize the notebook folder structure in a similar way -

Notebook folder structure
Figure 3: Notebook folder structure

Main folder contains the following notebooks -

actual-data-load : Loads the data (Data Load notebook shown in figure 1)

main-driver: Invokes other notebooks needed and is the end-to-end program (Driver Notebook shown in figure 1)

functions (folder): This contains the reusable functions that need to be tested. These will be called both from the main program and also the unit test driver

The functions to be tested
Figure 4: The functions to be tested (functions folder)

Functions to be tested

The function Notebooks themselves are very modular. Each Notebook implements a specific functionality with one or more input tables and one or more output tables. These can be persistent table or can be in-memory tables (temporary views)

The function to be tested — raw data transformation
Figure 5: The function to be tested — raw data transformation
Function to be tested — daily temperature variance
Figure 6: Function to be tested — daily temperature variance

Unit tests

Unit test folder contains the following notebooks -

unit-test-driver: The notebook that calls other tests. It can be thought of as a test suite

notebooks containing test cases: average_temperatures_test, data_transformation_test

The unit tests notebook folder
Figure 7: The unit tests notebook folder

Test driver

The test driver creates a table to hold the test results (CI_BUILD.TEST_RESULTS) and invokes the test cases using the %run command. Note that the test results table has to be persistent — atleast when invoking test cases as temporary views/tables are read-only. This test results table can be truncated after each run or can be kept for a longer period of time to view the test case result trends. This table can also be queried by CI build engines using the Databricks JDBC interface.

The unit test driver
Figure 8: The unit test driver

Test case notebook — Data setup and invocation

Figure 9: Test case notebook — data setup and invoking function

The first cell sets up test data. In the above example, data has been loaded from a test data file. This can also be created within the notebook if the test data is small. Note that this step needs Java/Scala/Python as we are creating a temporary view. Alternately, test data can be set up in a persistent Spark/Delta table and it can be used directly. Example:

CREATE OR REPLACE TEMP VIEW TEST_DATA AS (SELECT * FROM <some_persistent_table_containing_test_data> )

The second approach is more native to SQL developers, but involves the overhead of maintaining tables as opposed to files in the first approach. Additionally, use of Java/Scala/Python enables us to create data dynamically within the Notebook eliminating persistent files or tables altogether. It is purely a matter of preference in choosing between these two approaches for setting up test data.

The second cell creates a temporary view that is needed by the function under test. Think of this as an object created by unit test that is passed to function under test. Just like how that object in Java/Scala world is the interface between the function under test and the unit test case, this temporary table is the interface between the notebook under test and the unit test case notebook.

After the temporary view is created the next step is to invoke the function under test. This is achieved by the %run command. This can also be done using dbutils.notebook.run() command in a Python cell.

Test case notebook — assertion

After the notebook is run, we assert on a few conditions we need and populate the results to test database. Although databases such as Oracle have provided support for ASSERTION’s, it is not yet a part of Spark SQL. But these can be implemented using plain SQL and the logic to assert on. In the following example, assertion is made on the number of rows and on the content of a specific column. Use of WITH clause (CTE) greatly helps in building the temporary data needed to assert — think of it as a way of storing vectorized data structure or a List of objects in Scala.

Assertions in the test case
Figure 10: Assertions in test case

Checking the test run results

This can be done by running a select on the results table. CI tools can also use the JDBC interface to query this data, fail the build and report the results.

Test suite run results
Figure 11: Test suite run results

Finally, the main program, the one that runs in production, invokes the functions one-by-one to achieve the functionality. This is given just to show how the main driver program can be organized in a more modular way.

The main driver program — snippet 1
Figure 12: The main driver program — snippet 1 (Load data + invoke functions)
The main driver program — snippet 2
Figure 13: The main driver program — snippet 2 (Load data + invoke functions)
The main driver program — snippet 3 (Save output)
Figure 14: The main driver program — snippet 3 (Save output)

In the next part of this series, let us look at how to implement a similar unit testing approach in Snowflake.

--

--