Unit testing approach for SQL based data platform development — Part 1

Murali Suraparaju
5 min readFeb 16, 2022

--

Background

Implementing transformations using SQL in data pipelines has been the preferred approach of ETL developers for a very long time. Things were very different in the big data processing world that was highly popular in the last decade. Scala, Python and to some extent Java have been very popular in building Data Platforms using frameworks such as Apache Spark and Hadoop. However, SQL has the advantage of being a more data centric language as opposed to the imperative style used by 3rd generation languages such as Java and Scala.

The following figure shows the evolution in data engineering.

Evolution of Data Engineering Technologies
Figure 1: Evolution of Data Engineering Technologies

This blog is aimed at data engineers who have been developing data solutions using 3rd generation programming languages such as Java/Scala and are looking to switch over to the CLOUD based DaaS platforms.

During the decade of 2010, there were a lot of capability gaps in developing using Scala vs developing in SQL. This was mainly due to the fact that the underlying technology (written in languages such as Java/Scala) was not mature enough to provide a cross-compiler/language interface for multiple languages. After wide spread adoption and more production usage, these technologies started providing SQL interface on top of the underlying processing framework. A good example of this is Spark SQL. These interfaces have now matured to the point where they now support full ANSI SQL. This evolution, coupled with the fact that most of data processing has now moved to CLOUD native platforms, meant that creating data engineering programs has been greatly simplified. In platforms such as Databricks, code can be written in SQL in a Notebook and can be readily productionized with minimal effort.

Problem

The above evolution has also thrown up some challenges to the Data Engineering teams. When coding in 3rd generation languages, developers could follow engineering practices such as Unit Testing, Test Driven Development (TDD) and Continuous Integration (CI) which were standard in the application development (such as Microservices) world. Frameworks such as JUnit and ScalaTest have enabled the use of such practices for Data Processing as well. However, no such robust frameworks exist when coding in a Notebook with SQL. Some frameworks such as Microsoft Nutter (https://github.com/microsoft/nutter) exist, but they are very much limited to a specific CLOUD/DaaS (Azure/Databricks). In addition, this approach needs knowledge and usage of IDEs and Python.

There are full fledged tools such as dbt. However, this is a tool and needs to be purchased by the organization. Usually, developers may not have the luxury of using these comprehensive tools.

A more lightweight and simpler approach is presented in this blog. But the downside is developers need to carefully hand code the tests, test drivers and the main code. If the codebase is small/medium enough and a tool cannot be purchased, developers can use this approach. This three-part blog presents that approach for unit testing using two very popular DaaS solutions — Databricks and Snowflake. The first part details the approach conceptually. The second part provides the details of approach for Databricks along with code samples. The third part gives the approach and code samples for Snowflake.

There could be multiple other approaches and what is given in this blog can be considered one of them. As this is an approach, it does not involve use of any frameworks or tools. Developers can refer to this approach and use it as-is or tweak it to suit their needs.

Solution

The actual techniques used for unit testing would change based on the underlying platform — Databricks/Snowflake/Synapse etc. However, the overall approach and method remains the same. The following diagram shows a conceptual view of this approach:

Solution — Conceptual view
Figure 2: Solution — Conceptual view

Driver trigger

In case of regular flow, this is an orchestration engine such as Apache AirFlow or Azure Data Factory (ADF) that triggers the main driver code. In case of unit test flow, this is the continuous integration (CI) build trigger such as Jenkins job or Circle CI job. It also gathers the test results and reports them.

Driver code

This is the code that invokes the atomic business module(s) to achieve the end-to-end functionality. This could be a notebook or another stored procedure just like the atomic business logic code. The driver code will be different for code that runs in production (e.g., Main.scala) and for running unit tests (e.g. MonthlyAvgTemparatureCalculatorTest.scala). The production code would load actual data, invoke a series of modular business logic code and save the output. The Unit test code would invoke test data set up script, invoke the modular business logic code and assert on the output. Usual challenge faced when developing the test scripts is creating an isolated environment — such as a physical or in-memory database where the business logic can be tested and asserted in isolation. Techniques such as temporary databases or non-persistent in-memory tables provided by platforms such as Snowflake and Databricks can be used to create and destroy these ephemeral databases.

Modular code

This is the atomic business logic code that needs to be tested. Developers coming from Java/Scala/Python can think of this as the function that contains their independently testable business logic. (e.g. AvgTemparatureCalculator.scala#calculateMonthly()). In most cases, these are transformations and aggregations. These transformations and aggregations are built as modular components — these can be cells in a Notebook (Databricks) or stored procedures (Snowflake) or simple worksheets in some other technology. The main requirement for this modular code is that it must be independently invokable from another (caller) module.

Assert output

This is the most trickiest part for Java/Scala/Python developers. In 3rd generation languages especially using Unit test frameworks, the assert*() methods eliminate a lot of boilerplate code. In addition, 3rd generation languages have the advantage of storing the expected output and actual output in two variables and doing a simple comparison on them. Databases such as Oracle were planning to provide ASSERTION as a key word, but such discussion not yet made it to DaaS products. Hence, doing an assertion using SQL, becomes a bit challenging. We would need to resort to paradigms such as SELECT .. IF or SELECT .. CASE .. WHEN .. along with WITH clause (CTE) to achieve an assertion.

In the next part, we will dive into the exact techniques used to achieve this in Databricks — along with code samples.

--

--