Data testing framework (with python and dbt)
TL;DR: This article presents our data testing possibilities regarding our dbt/python pipeline stack. It provides a regard on how, what & when we can test. Understanding our own stack possibilities regarding tests is a crucial step to create a data testing philosophy & strategy!
In this short article we are going to present our data testing framework regarding our stack. This technical article comes from a need for us as a Data team to benchmark the possibilities we have when it comes to data testing. Indeed, data testing strategy is somehow a balance between the philosophy you have developed and the technological possibilities you have. So let’s dive right into our stack!
1. Description of the stack
Pipeline transformation
In the Ovrsea data team, we are in charge of transforming raw data into actionable data either for analytics or to provide external software with data. The various sources of data, as well as the multitude of tables force us to structure our pipeline.
The principe of pipeline design is clear : defining a general task by layer and parallelizing specific transformations on tables in each layer.
Our main data pipeline is constituted :
- first by a raw data schema that stores every database (with a dump and restore)
- then a first cleaning process is done with soft delete and simple cleaning tasks executed
- after that comes the company definitions to respect the “single source of truth” guideline and to enable clear definitions of any indicator shared at a company level.
- some general resolving are then done in the intermediate tables
- finally, ready-to-use tables and tables for exposures are computed.
Pipeline stack
To perform these transformations we are taking advantage of the power of three huge and well known technologies. Transformation is done using a mix between python and dbt while the pipeline is orchestrated by Airflow: details of each use is recalled in the schema below.
So this means that without any additional software to our pipeline we can only use python and dbt as framework to implement test!
2. Data testing types
One thing to get is that when talking about data you can test two things:
- The data itself
- The transformations of the data
Here are our possibilities to test these two elements considering our stack.
Dbt test
Dbt framework enables the user to implement simple tests on any raw or transformed data tables using SQL. Dbt is a powerful allie when it comes to test as it provides two very different types of test:
- Singular test : which are one-shot SQL statements in macros to test a specific behaviour of a specific table. This category of test enables complex solution-oriented test at a medium to low cost. Indeed the test is an SQL statement that adresses one complex check that you want to do on a specific table.
- Generic test : are reusable dbt tests that can still be customized by parametrizing inputs and can be apply many times on tables using the dbt documentation yaml. These test are often simple but have the advantages of being very easy to add on any table, any columns in a sec. By creating new generic test — through the custom feature test — you can design your testing plan and address generic issue.
Pytest library
Pytest handles the test on the python functions. This is simply unit test where you provide specific input and the expected output and see if you function works as expected. This is a very common framework for dev but here it will be used to test the python function that directly transform our data to generate our pipeline.
3. Data testing orchestration
The last question we need to answer is when are we testing ? In fact we have three possibilities:
- Testing at each merge request (as part of our CI/CD)
- Tests embedded in the main data transformation airflow pipeline
- Tests on general data using an other specific pipeline (launched once a day)
These possibilities are very interesting as they involved diverse opportunities in terms of test and alerting.
Testing at each merge request
During a PR we will be more interested in testing things that do not depend on data and only need to be tested once the code is ready; that is: python functions.
Indeed, the test on these functions will not be impacted by the real data as the test are relying on “dummy” data designed to test the specific behaviour of the function using pytest. Thus testing them only when the code changes during PR makes perfect sense!
Note: we are also working on triggering an airflow pipeline during the CI/CD to check that the entire data pipeline is computable with the code modified. However this is still a work in progress.
Testing embedded in the data transformation pipeline
During the data pipeline the things that we want to test will be the tables themselves and in particular to trigger critical tests. Indeed the interest in testing during the pipeline is to stop the pipeline if the test fails. So we have to consider dbt test that are very crucial for the good health of the pipeline. Generating downstream tables wouldn’t make sense if these tests fail and pushing this rigged data to our BI tools or integrated softwares would be risky.
Tests on general data during a specific pipeline
Finally on all generated and stored tables, we want to test the dbt tests that are not critical, but still important. For the least critical tests, it still makes sense to avoid enabling this test to break the pipeline. Alerts will enable us act before the next data pipeline runs.
Note: To trigger this test once a day we are defining another Airflow pipeline independent from the data pipeline.
In fact triggering a test during the data transformation pipeline or on a different dedicated pipeline is really about finding balance between freshness and accuracy. Blocking the entire pipeline for an error is really impacting for BI so we should choose wisely what should block the pipeline, and what should raise our attention without breaking the pipeline.
Here are two examples of test of either breaking or non breaking test.
A breaking test: “There is a mismatch between currency index and prices in our quotations” This could have massive impact and should be breaking to avoid mis-BI use.
A non breaking test: “The ETD date of a shipment is inaccurate” This should still trigger an alert, but as no massive or critical data is affected, we should still provide data to our BI team, as we believe this could be a specific case to deal with.
Conclusion
Dbt mixed with python is a powerful and efficient engine to handle testing framework during the entire pipeline. This first study was very important for us as it enables us implement our testing strategy regarding our philosophical drivers. To know more about data testing philosophy at Ovrsea check this other article!
Thank you for reading and feel free to reach to us! We will be more than happy to discuss and debate on data testing best practice with other starts up! Improving and challenging is key when it comes to testing!