Updates around Dataform assertions

Fumiaki Kobayashi
Towards Data Engineering
4 min readJun 21, 2024

New features and improvements to enhance the productivity of implementing tests

Although still in beta, Dataform is undergoing a major upgrade from 2.X to 3.0.

Among new features including new yaml setting files, the assertion-related updates caught my attention. I would like to introduce them in this article, including updates to a package that enables us to customize assertions with more flexibility.

Photo by JESHOOTS.COM on Unsplash

In Dataform 3.0, two options called dependOnDependencyAssertions and includeDependentAssertions, which can be a bit confusing terms, have been added. To give some background on the reason for adding this option, in Dataform, the model in the subsequent process is still executed even if a test fails. In other words, only the model creation process is a dependency by default. it is a different behavior of the dbt build command.

To prevent subsequent models from being created if a test fails, it was necessary to define dependencies option explicitly for each test. For example, if this first_view has 5 assertions, the sqlx file would look like this.

config {
type: "view",
dependencies: ["test1", "test2", "test3", "test4", "test5"]
}

SELECT
*
FROM
${ref("first_view")}

An important data mart likely has five or more tests. In such cases, it is quite tedious to add tests to dependencies each time. Furthermore, the risk of forgetting to modify the dependencies when adding tests is also increased.

That is why the dependeOnDependencyAssertions option is introduced. when enabled in the config block, it automatically creates a dependency for all assertions for the models referenced in ref and dependencies . It results in Dataform executing the model only if all assertions are successful.

config {
type: "view",
dependOnDependencyAssertions: true,
}

SELECT
*
FROM
${ref("first_view")}

You can see at a glance that the code has been simplified. As for the includeDependentAssertions option, all assertions specified by model name can be included in the dependency of the model you will execute. This one can be used when you want to set up assertion dependencies only for a specific model.

config { 
type: "view",
dependencies: [{name: "some_table", includeDependentAssertions: true}]
}

select test from ${ref({name: "some_other_table", includeDependentAssertions: true})}

In this case, the model will only be executed if all the some_table and some_other_table assertions are successful.

Enhancement of dataform-assertions package functionality

The dataform-assertions package, which makes it easy to implement rich tests in dataform, was introduced in my previous article. In a few months, some features have been added to this package, or rather, some of them were merged after I implemented them myself and sent a PR.

Extended source freshness assertions

  • TIEMSTAMP columns are now supported for comparisons with more granular units such as SECOND, MINUTE, and HOUR.
  • The time zone can now be specified for DATE types.

Due to this update, the assertion is richer than dbt’s source freshness test, partly because it is originally based on the DATE comparison.

dataFreshnessConditions: {
"first_table": {
"dateColumn": "updated_date",
"timeUnit": "DAY",
"delayCondition": 1,
"timeZone": "America/Los_Angeles"
},
"second_table": {
"dateColumn": "TIMESTAMP(updated_date)",
"timeUnit": "HOUR",
"delayCondition": 3,
}
},

Allow the config block to specify the test target of the model

This feature is useful when the test model is a table of TB size and only some proportion of the data is updated, and you want to reduce the cost by narrowing down the test target. If a past test was successful with the data that has not been changed since then, there is no reason to test again.

In dbt, this feature is implemented in the dbt core. you can specify the test range for each test using the config block.

version: 2

models:
- name: large_table
columns:
- name: my_column
tests:
- accepted_values:
values: ["a", "b", "c"]
config:
where: "created_at > CURRENT_DATE() - 7"

On the other hand, Dataform had no equivalent functionality. The only way to do this was to write custom tests or create individual models for testing. Therefore, I implemented the same concept directly in this package.

const commonAssertions = require("../index");

const commonAssertionsResult = commonAssertions({
globalAssertionsParams: {
"database": {project},
"schema": "assertions_" + dataform.projectConfig.vars.env,
"location": "EU",
"tags": ["assertions"],
},
// write test range config in this level
config: {
"first_table": {
"where": "updated_date >= CURRENT_DATE() - 7"
},
},
rowConditions: {
"first_table": {
"id_not_null": "id IS NOT NULL",
"id_strict_positive": "id > 0"
},
"second_table": {
"id_in_accepted_values": "id IN (1, 2, 3)"
}
},
...

In the case of the id_not_null test for first_table, the query would look like this. The original test condition is executed after filtering the table.

WITH
filtering AS (
SELECT
*
FROM
`{project}.dataform.first_table`
WHERE
updated_date >= CURRENT_DATE() - 7
)
SELECT "Condition not met: id IS NOT NULL, Table: `{project}.dataform.first_table`" AS assertion_description
FROM filtering
WHERE NOT (id IS NOT NULL)

Since dataform-assertions is still a small package, I could add this functionality within a few hours. There is still plenty of room to add more features, so if you are using dataform or want to be involved in OSS, please try to contribute.

--

--