Unit testing approach for SQL based data platform development — Part 2
In the previous article, we examined the unit test approach — conceptually. In this part, we will build it out on Databricks. As mentioned in the previous part, this is just an approach and does not involve use of any other frameworks/tools. Hence any developer can use it even in a restricted environment. Code shown in this article can be found here.
Dataset and use cases
The sample weather data set available with community edition of Databricks has been taken for implementing this approach. The data set has one parquet having this structure -
To showcase the unit testing approach, the following questions are asked of the dataset (use cases):
- Calculate monthly average temperatures
2. Calculate monthly average temperatures by elevation zone
3. Calculate monthly average temperatures by latitude
4. Calculate the daily temperature variance for each station
Logical program structure
The following figure shows the various components involved in implementing this technique on Databricks:
Notebook folder structure
To easily relate to the directory structure found in Java/Scala, developers can organize the notebook folder structure in a similar way -
Main folder contains the following notebooks -
actual-data-load : Loads the data (Data Load notebook shown in figure 1)
main-driver: Invokes other notebooks needed and is the end-to-end program (Driver Notebook shown in figure 1)
functions (folder): This contains the reusable functions that need to be tested. These will be called both from the main program and also the unit test driver
Functions to be tested
The function Notebooks themselves are very modular. Each Notebook implements a specific functionality with one or more input tables and one or more output tables. These can be persistent table or can be in-memory tables (temporary views)
Unit tests
Unit test folder contains the following notebooks -
unit-test-driver: The notebook that calls other tests. It can be thought of as a test suite
notebooks containing test cases: average_temperatures_test, data_transformation_test
Test driver
The test driver creates a table to hold the test results (CI_BUILD.TEST_RESULTS) and invokes the test cases using the %run command. Note that the test results table has to be persistent — atleast when invoking test cases as temporary views/tables are read-only. This test results table can be truncated after each run or can be kept for a longer period of time to view the test case result trends. This table can also be queried by CI build engines using the Databricks JDBC interface.
Test case notebook — Data setup and invocation
The first cell sets up test data. In the above example, data has been loaded from a test data file. This can also be created within the notebook if the test data is small. Note that this step needs Java/Scala/Python as we are creating a temporary view. Alternately, test data can be set up in a persistent Spark/Delta table and it can be used directly. Example:
CREATE OR REPLACE TEMP VIEW TEST_DATA AS (SELECT * FROM <some_persistent_table_containing_test_data> )
The second approach is more native to SQL developers, but involves the overhead of maintaining tables as opposed to files in the first approach. Additionally, use of Java/Scala/Python enables us to create data dynamically within the Notebook eliminating persistent files or tables altogether. It is purely a matter of preference in choosing between these two approaches for setting up test data.
The second cell creates a temporary view that is needed by the function under test. Think of this as an object created by unit test that is passed to function under test. Just like how that object in Java/Scala world is the interface between the function under test and the unit test case, this temporary table is the interface between the notebook under test and the unit test case notebook.
After the temporary view is created the next step is to invoke the function under test. This is achieved by the %run command. This can also be done using dbutils.notebook.run() command in a Python cell.
Test case notebook — assertion
After the notebook is run, we assert on a few conditions we need and populate the results to test database. Although databases such as Oracle have provided support for ASSERTION’s, it is not yet a part of Spark SQL. But these can be implemented using plain SQL and the logic to assert on. In the following example, assertion is made on the number of rows and on the content of a specific column. Use of WITH clause (CTE) greatly helps in building the temporary data needed to assert — think of it as a way of storing vectorized data structure or a List of objects in Scala.
Checking the test run results
This can be done by running a select on the results table. CI tools can also use the JDBC interface to query this data, fail the build and report the results.
Finally, the main program, the one that runs in production, invokes the functions one-by-one to achieve the functionality. This is given just to show how the main driver program can be organized in a more modular way.
In the next part of this series, let us look at how to implement a similar unit testing approach in Snowflake.