As we’ve mentioned before, one of the core design goals of Dataform is to make project compilation hermetic. The idea is to ensure that your final ELT pipeline is as reproducible as possible given the same input (your project code), with a few tightly-controlled exceptions (like support for ‘incremental’ tables).
Being able to reason this way about the code in Dataform pipelines gives us the opportunity to build some cool features into the Dataform framework. An example is our “run caching” feature.
Don’t waste time and money re-computing the same data
Most analytics pipelines are executed periodically as part of some schedule. Generally, these schedules are configured to run as often as necessary to keep the final data as up-to-date as the business requires.
Unfortunately, this can lead to a waste of resources. Consider a pipeline that is executed once an hour. If its input data doesn’t change between one execution and the next, then the next execution will result in no changes to the output data, but it’ll still cost time and money to run.
Instead, we believe that the pipeline should automatically detect if it’s not going to change the output data — and if so, then the affected stage(s) should be skipped, saving those resources.
We’ve built this feature into Dataform.
Run caching in Dataform
Try out an example project with run caching here! (click “incidents_by_date.sqlx” on the left-hand side)
You can turn run caching on in your project with a few small changes which are described here. Once enabled, run caching skips re-execution of code which cannot result in a change to output data.
For example, consider the following SQLX file, which configures Dataform to publish a table age_count
containing the transformed results of a query reading a people
relation:
config { name: "age_count", type: "table" }select age, count(1) from ${ref("people")} group by age
Dataform only needs to (re-)publish this table if any of the following conditions are true:
- The output table
age_count
doesn't exist - The output table
age_count
has changed since the last time this table was published (i.e. it was modified by something other than Dataform itself) - The query has changed since the last time the
age_count
table was published - The input table
people
has changed since the last time theage_count
table was published (or, ifpeople
is a view, then if any of the input(s) topeople
have changed)
Dataform uses these rules to decide whether or not to publish the table. If all of the tests fail, i.e. re-publishing the table would result in no change to the output table, then this action is skipped.
Building in intelligence so you don’t have to
At Dataform we believe that you shouldn’t have to manage the infrastructure involved in running analytics workloads.
This philosophy is what drives us to build out features like run caching, which automatically help to manage and operationalize analytics workloads, so that you don’t have to. All you need to do is define your business-logic transformations, and we’ll handle the rest.
If you’d like to learn more, the Dataform framework documentation is here. Join us on Slack and let us know what you think!
Originally published at https://dataform.co.